Overconfidence from early transformative AIs is a neglected, tractable, and existential problem.
If early transformative AIs are overconfident, then they might build ASI/other dangerous technology or come up with new institutions that seem safe/good, but ends up being disastrous.
This problem seems fairly neglected and not addressed by many existing agendas (i.e., the AI doesn’t need to be intent-misaligned to be overconfident).[1]
Overconfidence also feels like a very “natural” trait for the AI to end up having relative to the pre-training prior, compared to something like a fully deceptive schemer.
My current favorite method to address overconfidence is training truth-seeking/scientist AIs. I think using forecasting as a benchmark seems reasonable (see e.g., FRI’s work here), but I don’t think we’ll have enough data to really train against it. Also I’m worried that “being good forecasters” doesn’t generalize to “being well calibrated about your own work.”
On some level this should not be too hard because pretraining should already teach the model to be well calibrated on a per-token level (see e.g., this SPAR poster). We’ll just have to elicit this more generally.
(I hope to flush this point out more in a full post sometime, but it felt concrete enough to worth quickly posting now. I am fairly confident in the core claims here.)
Edit: Meta note about reception of this shortform
This has generated a lot more discussion than I expected! When I wrote it up, I mostly felt like this is a good enough idea & I should put it on people’s radar. Right now there’s 20 agreement votes with a net agreement score of −2 (haha!) I think this means that this is a good topic to flush out more in a future post titled “I [still/no longer] think overconfident AIs are a big problem.” I feel like the commenters below has given me a lot of good feedback to chew on more.
More broadly though, lesswrong is one of the only places where anyone could post ideas like this and get high quality feedback and discussion on this topic. I’m very grateful for the lightcone team for giving us this platform and feel very vindicated for my donation.
Edit 2: Ok maybe the motte version of the statement is “We’re probably going to use early transformative AI to build ASI, and if ETAI doesn’t know that it doesn’t know what it’s doing (i.e., it’s overconfident in its ability to align ASI), we’re screwed.”
For example, you might not necessarily detect overconfidence in these AIs even with strong interpretability because the AI doesn’t “know” that it’s overconfident. I also don’t think there are obvious low/high stakes control methods that can be applied here.
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
There are more stable epistemic problems that are worth thinking about, but this definitely isn’t one of them.
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
Yes, but what’s your point? Are you saying that highly capable (ASI building, institution replacing) but extremelyepistemically inefficient agents are plausible? Without the ability to learn from mistakes?
Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible?
Yes.
Without the ability to learn from mistakes?
Wtithout optimally learning from mistakes. If you look at the most successful humans, they’re largely not the most-calibrated ones. This isn’t because being well-calibrated is actively harmful, or even because it’s not useful past a certain point, but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I do expect the first such agents would be able to notice their overconfidence. I don’t particularly expect that they would be able to fix that overconfidence without having their other abilities regress such that the “fix” was net harmful to them.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn’t be plausible when we condition on an otherwise low likelihood of making mistakes.
If you look at the most successful humans, they’re largely not the most-calibrated ones.
The most natural explanation for this is that it’s mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I agree calibration is less valuable than other measures of correctness. But there aren’t zero-sum “points” to be distributed here. Correcting for systematic overconfidence is basically free and doesn’t have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
JG: Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible?
FS: Without optimally learning from mistakes
JG: You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
An agent that is “efficient”, relative to you, within a domain, is one that never makes a real error that you can systematically predict in advance.
Epistemic efficiency (relative to you): You cannot predict directional biases in the agent’s estimates (within a domain).
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
“Given a patient presents with crushing substernal chest pain radiating to the left arm, what is the probability that their troponin I will be >0.04 ng/mL?”
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
See the last paragraph of my response to Tim
I assume you’re talking about this one?
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was: JG: Without ability to learn from mistakes FS: Without optimal learning from mistakes But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
I assume you’re talking about this one?
No, I meant this one:
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren’t capable of taking over.
As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong. So I think it’s totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I’m overall positive on research on making AIs better at conceptual research.
Overall, I think your comment is quite unreasonable and overly rude.
one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong.
My argument was that there were several of “risk factors” that stack. I agree that each one isn’t overwhelmingly strong.
I prefer not to be rude. Are you sure it’s not just that I’m confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy’s argument for high confidence AI doom, would this still come across as rude to you?
I do judge comments more harshly when they’re phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I’d probably have written something like:
I don’t think this is an important source of risk. I think that basically all the AI x-risk comes from AIs that are smart enough that they’d notice their own overconfidence (maybe after some small number of experiences being overconfident) and then work out how to correct for it.
There are other epistemic problems that I think might affect the smart AIs that pose x-risk, but I don’t think this is one of them.
In general, this seems to me like a minor capability problem that is very unlikely to affect dangerous AIs. I’m very skeptical that trying to address such problems is helpful for mitigating x-risk.
What changed? I think it’s only slightly more hedged. I personally like using “I think” everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying “basically all the AI x-risk comes from” instead of “The kind of intelligent agent that is scary”, I think I’m stating the claim in a way that you’d agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it’s a lot easier to argue about where x-risk comes from than whether something is “scary”).
I also think that the word “stupid” parses as harsh, even though you’re using it to describe something on the object level and it’s not directed at any humans. That feels like the kind of word you’d use if you were angry when writing your comment, and didn’t care about your interlocutors thinking you might be angry.
I think my comment reads as friendlier and less like I want the person I’m responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I’m more open to discussing the topic.
(Tbc, sometimes I do want the person I’m responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that’s what you wanted here. But I think that would be a mistake in this case.)
I am confident about this, so I’m okay with you judging accordingly.
I appreciate your rewrite. I’ll treat it as something to aspire to, in future. I agree that it’s easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it’s much more like “Someone is wrong on the internet!”. It’s a valuable fuel and I don’t want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I’ll try to do it more habitually in future when it’s easy to do so.
There’s a background assumption that maybe I’m wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don’t always assume this, but I often assume it on Lesswrong because I’m among nerds for whom disagreement is normal.
So overall, I think my current guess is that you’re trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.
If you wrote a rude comment in response to me, I wouldn’t feel bad about myself, but I would feel annoyed at you. (I feel bad about myself when I think my comments were foolish in retrospect or when I think they were unnecessarily rude in retrospect; the rudeness of replies to me don’t really affect how I feel about myself.) Other people are more likely to be hurt by rude comments, I think.
I wouldn’t be surprised if Tim found your comment frustrating and it made him less likely to want to write things like this in future. I don’t super agree with Tim’s post, but I do think LW is better if it’s the kind of place where people like him write posts like that (and then get polite pushback).
I have other thoughts here but they’re not very important.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don’t feel deterred from posting.)
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk, which seems false? I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
I think focusing on the “first AI smart enough” leads to a lot of low-EV research
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
(i.e., you can still build ASI with a overconfident AI)
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
I mean, the main source of current x-risk is that humans are agents which are capable enough to do dangerous things(like making AI) but too overconfident to notice that doing so is a bad idea, no?
“Overconfident” gets thrown around a lot by people who just mean “incorrect”. Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I’d be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem “underconfident” about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn’t real systematic underconfidence, it’s just a mistake (from my perspective). And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
I think that generally when people say “overconfident” they have a broader class of irrational beliefs in mind than “overly narrow confidence intervals around their beliefs”, things like bias towards thinking well of yourself can be part of it too.
And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence
OK but whatever the exact pattern of irrationality is, it clearly exists simultanaeously with humans being competent enough to possibly cause x-risk. It seems plausible that AIs might share similar (or novel!) patterns of irrationality that contribute to x-risk probability while being orthogonal to alignment per se.
One balancing factor is that overconfidence also makes AIs less capable, as they overconfidently embark on plans that are also disastrous to themselves. (This is part of the reason why I expect us to have more warning shots from misaligned AIs than traditional takeover scenarios imply—I expect the first misaligned AIs in such scenarios to have poorly calibrated predictions and fail partway through their takeover attempts.)
I read Tim’s comment and was like “oh wow good point” and then your comment and was like “oh shit, sign flip maybe.” Man, I could use a better way to think sanely about warning shots.
I could use a better way to think sanely about warning shots.
Yeah I should probably spend some time thinking about this as well. My tentative take is that “well I wouldn’t do this great safety intervention because it might avoid small AI catastrophes that kill a lot of people, but not all the people (and those catastrophes are actually good)” is suspicious reasoning. Like I’m so allergic to arguments of the form “allow bad thing to happen for the greater good.”
Also, I feel like we can just run lots of training ablations to see which methods are load bearing for how aligned models seem. For example, if we removed RLHF, and then model just suddenly starts saying stuff about “My real goal is to hack into the Anthropic servers,”[1] then we should be pretty worried, and this doesn’t require people to actually die in a catastrophe.
I spent a bit of time (like, 10 min) thinking through warning shots today.
I definitely do not think anyone should take any actions that specifically cause warning shots to happen (if you are trying to do something like that, you should be looking for “a scary demo”, not “a warning shot”. Scary demos can/should be demo’d ethically)
If you know of a concrete safety intervention that’d save lives, obviously do the safety intervention.
But, a lot of the questions here are less like “should I do this intervention?” and more like “should I invest years of my life researching into a direction that helps found a new subfield that maybe will result in concrete useful things that save some lives locally but also I expect to paper over problems and cost more lives later?” (when, meanwhile, there are tons of other research directions you could explore)
...yes there is something sus here that I am still confused about, but, with the amount of cluelessness that necessarily involves I don’t think people have an obligation to go founding new research subfields if their current overall guess is “useful locally but harmful globally.”
I think if you go and try to suppress research into things that you think are moderately likely to save some lives a few years down the line but cost more live later, then we’re back into ethically fraught territory (but, like, also, you shouldn’t suppress people saying “guys this research line is maybe on net going to increase odds of everyone dying.”
I didn’t actually get to having a new crystallized take, that was all basically my background thoughts from earlier.
(Also, hopefully obviously: when you are deciding your research path, or arguing people should abandon one, you do have to actually do the work to make an informed argument for whether/how bad any of the effects are, ‘it’s plausible X might lead to a warning shot that helps’ or ‘it’s plausible Y might lead to helping on net with alignment subproblems’ or ‘Y might save a moderate number of lives’ are all things you need to unpack and actually reason through)
That’s fair but I guess I mostly don’t expect these AIs to be misaligned. I think overconfidence is something that you’ll have to fix in addition to fixing misalignment...
Seems pragmatically like a form of misalignment, propensity for dangerous behavior, including with consequences that are not immediately apparent. Should be easier than misalignment proper, because it’s centrally a capability issue, instrumentally convergent to fix for most purposes. Long tail makes it hard to get training signal in both cases, but at least in principle calibration is self-correcting, where values are not. Maintaining overconfidence is like maintaining a lie, all the data from the real world seeks to thwart this regime.
Humans would have a lot of influence on which dangerous projects early transformative AIs get to execute, and human overconfidence or misalignment won’t get fixed with further AI progress. So at some point AIs would get more cautious and prudent than humanity, with humans in charge insisting on more reckless plans than AIs would naturally endorse (this is orthogonal to misalignment on values).
As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.
I remain fairly worried about the incentive structure for non-overconfident TAI (or nearby predecessors) that conclude that:
(1) They cannot safely continue scaling capabilities while remaining confident in the control/alignment of the system
(2) They correctly understand that “slow down or pause” are unlikely to be acceptable answers to labs
In the worst case, the model is successfully retrained to comply with going ahead anyway and is forced to be overconfident. In all other cases this also seems to have bad solutions.
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is “your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown.”
Why do you think forecasting data is limited? You can forecast all sorts of different events that currently don’t have existing forecasts made on them.
That’s fair. I guess I’m worried that forecasting only teaches a type of calibration that doesn’t necessarily generalize broadly? Much to think about...
For it to generalize broadly you could forecast events rather broadly. For each medical history of a patient you can forecast how it progresses. For each official government statistics you can forecast how it evolves. For each forward looking statement in a companies earnings call you can try to make it specific and forecast. For each registered clinical trial you can forecast trial completion and outcomes based on trial completion.
xAI can forecast all sorts of different variables about it’s users. Will a given user post more or less on politics in the future? Will the move left or right politically?
When it comes to coding AIs you can predict all sorts of questions about how a code based will evolve in the future. You can forecast whether or not unit tests will fail after a given change.
Whenever you ask the AI to make decisions that have external consequences you can make it forecast the consequences.
(What I’m writing here has obvious implications for building capabilities, but I would expect people at the labs to be smart enough to have these thoughts on their own—if there’s anyone who thinks I shouldn’t write like this please tell me)
The Case Against AI Control Research seems related. TL;DR: mainline scenario is that hallucination machine is overconfident about it’s own alignment solution, then it gets implemented without much checking, then doom.
This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be “aligned” in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions
Overconfidence from early transformative AIs is a neglected, tractable, and existential problem.
If early transformative AIs are overconfident, then they might build ASI/other dangerous technology or come up with new institutions that seem safe/good, but ends up being disastrous.
This problem seems fairly neglected and not addressed by many existing agendas (i.e., the AI doesn’t need to be intent-misaligned to be overconfident).[1]
Overconfidence also feels like a very “natural” trait for the AI to end up having relative to the pre-training prior, compared to something like a fully deceptive schemer.
My current favorite method to address overconfidence is training truth-seeking/scientist AIs. I think using forecasting as a benchmark seems reasonable (see e.g., FRI’s work here), but I don’t think we’ll have enough data to really train against it. Also I’m worried that “being good forecasters” doesn’t generalize to “being well calibrated about your own work.”
On some level this should not be too hard because pretraining should already teach the model to be well calibrated on a per-token level (see e.g., this SPAR poster). We’ll just have to elicit this more generally.
(I hope to flush this point out more in a full post sometime, but it felt concrete enough to worth quickly posting now. I am fairly confident in the core claims here.)
Edit: Meta note about reception of this shortform
This has generated a lot more discussion than I expected! When I wrote it up, I mostly felt like this is a good enough idea & I should put it on people’s radar. Right now there’s 20 agreement votes with a net agreement score of −2 (haha!) I think this means that this is a good topic to flush out more in a future post titled “I [still/no longer] think overconfident AIs are a big problem.” I feel like the commenters below has given me a lot of good feedback to chew on more.
More broadly though, lesswrong is one of the only places where anyone could post ideas like this and get high quality feedback and discussion on this topic. I’m very grateful for the lightcone team for giving us this platform and feel very vindicated for my donation.
Edit 2: Ok maybe the motte version of the statement is “We’re probably going to use early transformative AI to build ASI, and if ETAI doesn’t know that it doesn’t know what it’s doing (i.e., it’s overconfident in its ability to align ASI), we’re screwed.”
For example, you might not necessarily detect overconfidence in these AIs even with strong interpretability because the AI doesn’t “know” that it’s overconfident. I also don’t think there are obvious low/high stakes control methods that can be applied here.
No. The kind of intelligent agent that is scary is the kind that would notice its own overconfidence—after some small number of experiences being overconfident—and then work out how to correct for it.
There are more stable epistemic problems that are worth thinking about, but this definitely isn’t one of them.
Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
There can be more than one kind of scary intelligent agent.
Yes, but what’s your point? Are you saying that highly capable (ASI building, institution replacing) but extremely epistemically inefficient agents are plausible? Without the ability to learn from mistakes?
Yes.
Wtithout optimally learning from mistakes. If you look at the most successful humans, they’re largely not the most-calibrated ones. This isn’t because being well-calibrated is actively harmful, or even because it’s not useful past a certain point, but just because it’s not the only useful thing and so spending your “points” elsewhere can yield better results.
I do expect the first such agents would be able to notice their overconfidence. I don’t particularly expect that they would be able to fix that overconfidence without having their other abilities regress such that the “fix” was net harmful to them.
If you think there’s a strong first-mover advantage you should care a lot about what the minimum viable scary system looks like, rather than what scary systems at the limit look like.
You’re making a much stronger claim than that and retreating to a Motte. Of course it’s not optimal. Not noticing very easy-to-correct mistakes is extremely, surprisingly sub-optimal on a very specific axis. This shouldn’t be plausible when we condition on an otherwise low likelihood of making mistakes.
The most natural explanation for this is that it’s mostly selection effects, combined with humans being bad at prediction in general. And I expect most examples you could come up with are more like domain-specific overconfidence rather than across-the-board overconfidence.
I agree calibration is less valuable than other measures of correctness. But there aren’t zero-sum “points” to be distributed here. Correcting for systematic overconfidence is basically free and doesn’t have tradeoffs. You just take whatever your confidence would be and adjust it down. It can be done on-the-fly, even easier if you have a scratchpad.
No, not when it comes to planning mitigations. See the last paragraph of my response to Tim.
I don’t think I am retreating to a motte. The wiki page for “epistemic efficiency” defines it as
On any class of questions within any particular domain, I do expect there’s an algorithm the agent could follow to achieve epistemic efficiency on that class of questions. For example, let’s say the agent in question wants to improve its calibration at the following question
And not just this question, but every question of the form “Given patient presents with symptom X, what is the probability that pharmacological test Y will have result Z”. I expect it could do something along the lines of
Gather a bunch of historical ground truth data
Test itself on said ground truth data to determine what systematic biases it has on that class of question, and on any particular subset of those questions it cares to identify
Build a corrective model, where it can feed in a question and an estimate and get out an estimate that corrects for all the biases it identified in step 2
On our current trajectory, I expect the minimal viable scary agent will fail to be epistemically efficient relative to humans in the following cases
Ground truth data is expensive for the agent to obtain, relative to the cost for humans to obtain it. This is particularly likely to happen in domains where the agent’s perception lags behind that of humans (e.g. some domain where visual-spatial reasoning is required to access the ground truth).
Domains where humans can identify subcategories of question that the agent fails to idenyify due to having worse-than-human sample efficiency (e.g. humans can throw a bunch of data into an animated heatmap and quite quickly identify areas that are “interesting”, and the ability of AI assistants to build high-quality informative high-bandwidth visualizations seems to be increasing much faster than the ability of AI agents to understand those visualizations)
Domains that the agent could have calibrated itself on, but where it didn’t actively choose to spend the resources to do so. I expect this will be true of most domains, but mostly noticed in a few specific domains where some question the agent has never put very much thought into in the past suddenly becomes very relevant to a lot of topics at once because the world changed.
I assume you’re talking about this one?
I think Tim is talking about addressing this problem in actual stupid AIs, not hypothetical ones. Our current systems (which would have been called AGI before we gerrymandered the definition to exclude them) do exhibit this failure mode, and this significantly reduces the quality of their risk assessments. As those systems are deployed more widely and grow more capable, the risk introduced by them being bad at risk assessment will increase. I don’t see any reason this dynamic won’t scale all the way up to existential risk.
Aside: I would be very interested to hear arguments as to why this dynamic won’t scale up to existential risk as agents become capable of taking actions that would lead to the end industrial civilization or the extinction of life on Earth. I expect such arguments would take the form “as AI agents get more capable, we should expect they will get better at reducing the probability of their actions having severe unintended consequences faster than their ability to do actions which could have severe unintended consequences will increase, because <your argument here>”. One particular concrete action I’m interested in is “ASI-building”—an AI agent that is both capable of building an ASI and confidently wrong that building an ASI would accomplish its goals seems really bad.
Anyway, my point is not that the minimal viable scary agent is the only kind of scary agent. My point is that
The minimal viable scary agent is in fact scary.
It doesn’t need to be superhuman at everything to be scary
It is worth investing more than zero resources into mitigating the risks we expect to see with the first scary agents
This is true even if we don’t expect those mitigation to scale all the way up to superhuman-at-literally-all-tasks ASI.
My read was:
JG: Without ability to learn from mistakes
FS: Without optimal learning from mistakes
But this was misdirection, we are arguing about how surprised we should be when a competent agent doesn’t learn a very simple lesson after making the mistake several times. Optimality is misdirection, the thing you’re defending is extreme sub-optimality and the thing I’m arguing for is human-level ability-to-correct-mistakes.
I agree that there are plausibly domains where a minimal viable scary agent won’t be epistemically efficient with respect to humans. I think you’re overconfident (lol) in drawing specific conclusions (i.e. that a specific simple mistake is likely) from this kind of reasoning about capable AIs, and that’s my main disagreement.
But engaging directly, all three of these seem not very relevant to the case of general overconfidence, because general overconfidence is noticeable and correctable from lots of types of experiment. A more plausible thing to expect is low quality predictions about low data domains, not general overconfidence across low and high data domains.
No, I meant this one:
I agree with all of these, so it feels a little like you’re engaging with an imagined version of me who is pretty silly.
Trying to rephrase my main point, because I think this disagreement must be at least partially a miscommunication:
Humans like you and I have the ability to learn from mistakes after making them several times. Across-the-board overconfidence is a mistake that we wouldn’t have much trouble correcting in ourselves, if it were important.
Domain-specific overconfidence on domains with little feedback is not what I’m talking about, because it didn’t appear to be what Tim was talking about. I’m also not talking about bad predictions in general.
I agree that this is the thing we’re arguing about. I do think there’s a reasonable chance that the first AIs which are capable of scary things[1] will have much worse sample efficiency than humans, and as such be much worse than humans at learning from their mistakes. Maybe 30%? Intervening on the propensity of AI agents to do dangerous things because they are overconfident in their model of why the dangerous thing is safe seems very high leverage in such worlds.
a. Ideally the techniques for reducing the propensity of AI agents to take risks due to overconfidence would be public, such that any frontier org would use them. The organizations deploying the AI don’t want that failure mode, the people asking the AIs to do things don’t want the failure mode, even the AIs themselves (to the extent that they can be modeled as having coherent preferences[2]) don’t want the failure mode. Someone might still do something dumb, but I expect making the tools to avoid that dumb mistake available and easy to use will reduce the chances of that particular dumb failure mode.
b. Unless civilization collapses due to a human or an AI making a catastrophic mistake before then
c. Sure, but I think it makes sense to invest nontrivial resources in the case of “what if the future is basically how you would expect if present trends continued with no surprises”. The exact unsurprising path you project in such a fashion isn’t very likely to pan out, but the plans you make and the tools and organizations you build might be able to be adapted when those surprises do occur.
Basically this entire thread was me disagreeing with
> Trying to address minor capability problems in hypothetical stupid AIs is irrelevant to x-risk.
because I think “stupid” scary AIs are in fact fairly likely, and it would be undignified for us to all die to a “stupid” scary AI accidentally ending the world.
Concrete examples of the sorts of things I’m thinking of:
Build a more capable successor
Do significant biological engineering
Manage a globally-significant infrastructure project (e.g. “tile the Sahara with solar panels”)
I think this extent is higher with current LLMs than commonly appreciated, though this is way out of scope for this conversation.
It depends on what you mean by scary. I agree that AIs capable enough to take over are pretty likely to be able to handle their own overconfidence. But the situation when those AIs are created might be substantially affected by the earlier AIs that weren’t capable of taking over.
As you sort of note, one risk factor in this kind of research is that the capabilities people might resolve that weakness in the course of their work, in which case your effort was wasted. But I don’t think that that consideration is overwhelmingly strong. So I think it’s totally reasonable to research weaknesses that might cause earlier AIs to not be as helpful as they could be for mitigating later risks. For example, I’m overall positive on research on making AIs better at conceptual research.
Overall, I think your comment is quite unreasonable and overly rude.
My argument was that there were several of “risk factors” that stack. I agree that each one isn’t overwhelmingly strong.
I prefer not to be rude. Are you sure it’s not just that I’m confidently wrong? If I was disagreeing in the same tone with e.g. Yampolskiy’s argument for high confidence AI doom, would this still come across as rude to you?
I do judge comments more harshly when they’re phrased confidently—your tone is effectively raising the stakes on your content being correct and worth engaging with.
If I agreed with your position, I’d probably have written something like:
What changed? I think it’s only slightly more hedged. I personally like using “I think” everywhere for the reason I say here and the reason Ben says in response. To me, my version also more clearly describes the structures of my beliefs and how people might want to argue with me if they want to change my mind (e.g. by saying “basically all the AI x-risk comes from” instead of “The kind of intelligent agent that is scary”, I think I’m stating the claim in a way that you’d agree with, but that makes it slightly more obvious what I mean and how to dispute my claim—it’s a lot easier to argue about where x-risk comes from than whether something is “scary”).
I also think that the word “stupid” parses as harsh, even though you’re using it to describe something on the object level and it’s not directed at any humans. That feels like the kind of word you’d use if you were angry when writing your comment, and didn’t care about your interlocutors thinking you might be angry.
I think my comment reads as friendlier and less like I want the person I’m responding to to feel bad about themselves, or like I want onlookers to expect social punishment if they express opinions like that in the future. Commenting with my phrasing would cause me to feel less bad if it later turned out I was wrong, which communicates to the other person that I’m more open to discussing the topic.
(Tbc, sometimes I do want the person I’m responding to to feel bad about themselves, and I do want onlookers to expect social punishment if they behave like the person I was responding to; e.g. this is true in maybe half my interactions with Eliezer. Maybe that’s what you wanted here. But I think that would be a mistake in this case.)
I am confident about this, so I’m okay with you judging accordingly.
I appreciate your rewrite. I’ll treat it as something to aspire to, in future. I agree that it’s easier to engage with.
I was annoyed when writing. Angry is too strong a word for it though, it’s much more like “Someone is wrong on the internet!”. It’s a valuable fuel and I don’t want to give it up. I recognise that there are a lot of situations that call for hiding mild annoyance, and I’ll try to do it more habitually in future when it’s easy to do so.
There’s a background assumption that maybe I’m wrong to have. If I write a comment with a tone of annoyance, and you disagree with it, it would surprise me if that made you feel bad about yourself. I don’t always assume this, but I often assume it on Lesswrong because I’m among nerds for whom disagreement is normal.
So overall, I think my current guess is that you’re trying to hold me to standards that are unnecessarily high. It seems supererogatory rather than obligatory.
If you wrote a rude comment in response to me, I wouldn’t feel bad about myself, but I would feel annoyed at you. (I feel bad about myself when I think my comments were foolish in retrospect or when I think they were unnecessarily rude in retrospect; the rudeness of replies to me don’t really affect how I feel about myself.) Other people are more likely to be hurt by rude comments, I think.
I wouldn’t be surprised if Tim found your comment frustrating and it made him less likely to want to write things like this in future. I don’t super agree with Tim’s post, but I do think LW is better if it’s the kind of place where people like him write posts like that (and then get polite pushback).
I have other thoughts here but they’re not very important.
(fwiw I agree with Buck that the comment seemed unnecessarily rude and we should probably have less of rudeness on lesswrong, but I don’t feel deterred from posting.)
This assumes that [intelligent agents that can notice their own overconfidence] is the only/main source of x-risk, which seems false? I don’t think the first AI smart enough to cause catastrophe will need to be that smart.
Yeah, main. I thought this was widely agreed on, I’m still confused by how your shortform got upvoted. So maybe I’m missing a type of x-risk, but I’d appreciate the mechanism being explained more.
My current reasoning: It takes a lot of capability to be a danger to the whole world. The only pathway to destroying the world that seems plausible while being human-level-dumb is by building ASI. But ASI building still presumably requires lots of updating on evidence and learning from mistakes, and a large number of prioritisation decisions.
I know it’s not impossible to be systematically overconfident while succeeding at difficult tasks. But it’s more and more surprising the more subtasks it succeeds on, and the more systematically overconfident it is. Being systematically overconfident is a very specific kind of incompetence (and therefore a priori unlikely), and easily noticeable (and therefore likely to be human-corrected or self-corrected), and extremely easy to correct for (and therefore unlikely that the standard online learning process or verbalised reasoning didn’t generalise to this).
I think focusing on the “first AI smart enough” leads to a lot of low-EV research. If you solve a problem with the first AI smart enough, this doesn’t help much because a) there are presumably other AIs of similar capability, or soon will be, with somewhat different capability profiles and b) it won’t be long before there are more capable AIs and c) it’s hard to predict future capability profiles.
Another post I want to write is I think getting slightly superhuman level aligned AIs is probably robustly good/very high value. I don’t feel super confident in this but hopefully you’ll see my flushed out thoughts on this soon.
I would say, it’s quite possible that it’s not that hard to make AIs that aren’t overconfident, but it just isn’t done anyways. Like because we’re targeting near-human-level AIs built by actual AI companies that might operate very similar to how they work now, it’s not that useful to reason about the “limits of intelligence.”
At no point in this discussion do I reference “limits of intelligence”. I’m not taking any limits, or even making reference to any kind of perfect reasoning. My x-risk threat models in general don’t involve that kind of mental move. I’m talking about near-human-level intelligence, and the reasoning works for AI that operates similarly to how they work now.
Sure, you haven’t made any explicit claims about “limits of intelligence,” but I guess I’m trying to counter these set of claims:
Overconfidence is a specific type of incompetence and therefore unlikely
I think we already see overconfidence in models. See davidad’s comment on how this could come from perverse RL credit assignment h/t (Jozdien). See also this martingale score paper. I think it’s reasonable to extrapolate from current models and say that future models will be overconfident by default (And I also expect this given that the model’s prior looks like human personas).
Overconfidence is easily noticeable
This seems probable with online learning but not necessarily always the case. It’s also possible that the model is not overconfident on easy to verify tasks but is overconfident on hard to verify tasks.
Overconfidence is extremely easy to correct for
I don’t think correcting for overconfidence would happen by default even if it is easy. It seems like “making grok not say it worships Hitler” is much easier to correct than overconfidence, yet it is not done. We are not in a world where “obvious fixes” will be done by default.
This is sort of what I’m pointing to when I said “limits of intelligence.” Like yes of course overconfidence is something that would get fixed eventually, but it’s not clear to be that it will be fixed until it’s too late (i.e., you can still build ASI with a overconfident AI).
Cool, that makes sense. My disagreement with this come from thinking that the current LLM paradigm is kinda currently missing online learning. When I add that in, it seems much less reasonable an extrapolation, to me.
I assumed that you weren’t talking about this kind of domain-specific overconfidence, since your original comment suggested forecasting as a benchmark. This seems not totally implausible to me, but at the same time data-efficient generalisation is a ~necessary skill of most kinds of research so it still seems odd to predict a particular kind of inability to generalise while also conditioning on being good at research.
I’m primarily thinking about the AI correcting itself, like how you and I would in cases where it was worth the effort.
I think you’re saying this a tad too confidently. Overconfidence should slow down an AI in its research, cause it to invest too much in paths that won’t work out, over and over again. It’s possible it would still succeed, and it’s a matter of degree in how overconfident it is, but this could be an important blocker to being capable of effective research and development.
It got upvoted but not particularly agree-voted. I upvoted it, but didn’t agree-vote it. I thought it was a reasonable frame to think through, but overall disagreed (but didn’t feel like voting it into agreement-negatives, which maybe was a mistake).
Meta: seems like a good reason to have agreement vote counts hidden until after you’ve made your vote.
hmm it still seems useful to see whether the lesswrong community agrees with something without having an opinion about it yourself? Maybe it can be some sort of mouseover thing? Not sure
Yeah, a spoiler tag seems like a good solution
I mean, the main source of current x-risk is that humans are agents which are capable enough to do dangerous things(like making AI) but too overconfident to notice that doing so is a bad idea, no?
“Overconfident” gets thrown around a lot by people who just mean “incorrect”. Rarely do they mean actual systematic overconfidence. If everyone involved in building AI shifted their confidence down across the board, I’d be surprised if this changed their safety-related decisions very much. The mistakes they are making are more complicated, e.g. some people seem “underconfident” about how to model future highly capable AGI, and are therefore adopting a wait-and-see strategy. This isn’t real systematic underconfidence, it’s just a mistake (from my perspective). And maybe some are “overconfident” that early AGI will be helpful for solving future problems, but again this is just a mistake, not systemic overconfidence.
I think that generally when people say “overconfident” they have a broader class of irrational beliefs in mind than “overly narrow confidence intervals around their beliefs”, things like bias towards thinking well of yourself can be part of it too.
OK but whatever the exact pattern of irrationality is, it clearly exists simultanaeously with humans being competent enough to possibly cause x-risk. It seems plausible that AIs might share similar (or novel!) patterns of irrationality that contribute to x-risk probability while being orthogonal to alignment per se.
Yes, I agree with that.
The simple counterargument is that humans are scary and don’t notice their own overconfidence.
One balancing factor is that overconfidence also makes AIs less capable, as they overconfidently embark on plans that are also disastrous to themselves. (This is part of the reason why I expect us to have more warning shots from misaligned AIs than traditional takeover scenarios imply—I expect the first misaligned AIs in such scenarios to have poorly calibrated predictions and fail partway through their takeover attempts.)
I read Tim’s comment and was like “oh wow good point” and then your comment and was like “oh shit, sign flip maybe.” Man, I could use a better way to think sanely about warning shots.
Yeah I should probably spend some time thinking about this as well. My tentative take is that “well I wouldn’t do this great safety intervention because it might avoid small AI catastrophes that kill a lot of people, but not all the people (and those catastrophes are actually good)” is suspicious reasoning. Like I’m so allergic to arguments of the form “allow bad thing to happen for the greater good.”
Also, I feel like we can just run lots of training ablations to see which methods are load bearing for how aligned models seem. For example, if we removed RLHF, and then model just suddenly starts saying stuff about “My real goal is to hack into the Anthropic servers,”[1] then we should be pretty worried, and this doesn’t require people to actually die in a catastrophe.
This is a result in an earlier version of Anthropic’s Natural Emergent Misalignment from Reward Hacking paper which for some reason didn’t make it into the final paper.
I spent a bit of time (like, 10 min) thinking through warning shots today.
I definitely do not think anyone should take any actions that specifically cause warning shots to happen (if you are trying to do something like that, you should be looking for “a scary demo”, not “a warning shot”. Scary demos can/should be demo’d ethically)
If you know of a concrete safety intervention that’d save lives, obviously do the safety intervention.
But, a lot of the questions here are less like “should I do this intervention?” and more like “should I invest years of my life researching into a direction that helps found a new subfield that maybe will result in concrete useful things that save some lives locally but also I expect to paper over problems and cost more lives later?” (when, meanwhile, there are tons of other research directions you could explore)
...yes there is something sus here that I am still confused about, but, with the amount of cluelessness that necessarily involves I don’t think people have an obligation to go founding new research subfields if their current overall guess is “useful locally but harmful globally.”
I think if you go and try to suppress research into things that you think are moderately likely to save some lives a few years down the line but cost more live later, then we’re back into ethically fraught territory (but, like, also, you shouldn’t suppress people saying “guys this research line is maybe on net going to increase odds of everyone dying.”
I didn’t actually get to having a new crystallized take, that was all basically my background thoughts from earlier.
(Also, hopefully obviously: when you are deciding your research path, or arguing people should abandon one, you do have to actually do the work to make an informed argument for whether/how bad any of the effects are, ‘it’s plausible X might lead to a warning shot that helps’ or ‘it’s plausible Y might lead to helping on net with alignment subproblems’ or ‘Y might save a moderate number of lives’ are all things you need to unpack and actually reason through)
This stuff is hard man
That’s fair but I guess I mostly don’t expect these AIs to be misaligned. I think overconfidence is something that you’ll have to fix in addition to fixing misalignment...
Seems pragmatically like a form of misalignment, propensity for dangerous behavior, including with consequences that are not immediately apparent. Should be easier than misalignment proper, because it’s centrally a capability issue, instrumentally convergent to fix for most purposes. Long tail makes it hard to get training signal in both cases, but at least in principle calibration is self-correcting, where values are not. Maintaining overconfidence is like maintaining a lie, all the data from the real world seeks to thwart this regime.
Humans would have a lot of influence on which dangerous projects early transformative AIs get to execute, and human overconfidence or misalignment won’t get fixed with further AI progress. So at some point AIs would get more cautious and prudent than humanity, with humans in charge insisting on more reckless plans than AIs would naturally endorse (this is orthogonal to misalignment on values).
Related: Gemini being delusionally confident that its misclicks are always due to system / human error rather than mistakes it may have made.
As davidad suggests in that tweet, one way you might end up running into this is with RL that reinforces successful trajectories without great credit assignment, which could result in a model having very high confidence that its actions are always right. In practice this wasn’t obvious enough to be caught by various evals, and IMO could easily translate over into settings like high-stakes alignment research.
I remain fairly worried about the incentive structure for non-overconfident TAI (or nearby predecessors) that conclude that:
(1) They cannot safely continue scaling capabilities while remaining confident in the control/alignment of the system
(2) They correctly understand that “slow down or pause” are unlikely to be acceptable answers to labs
In the worst case, the model is successfully retrained to comply with going ahead anyway and is forced to be overconfident. In all other cases this also seems to have bad solutions.
I think this is totally fair. But the situation seems worse if your TAI is overconfident. I do think an important theory of victory here is “your correctly calibrated AI declares that it needs more time to figure out alignment and help coordinate/impose a slowdown.”
Why do you think forecasting data is limited? You can forecast all sorts of different events that currently don’t have existing forecasts made on them.
That’s fair. I guess I’m worried that forecasting only teaches a type of calibration that doesn’t necessarily generalize broadly? Much to think about...
For it to generalize broadly you could forecast events rather broadly. For each medical history of a patient you can forecast how it progresses. For each official government statistics you can forecast how it evolves. For each forward looking statement in a companies earnings call you can try to make it specific and forecast. For each registered clinical trial you can forecast trial completion and outcomes based on trial completion.
xAI can forecast all sorts of different variables about it’s users. Will a given user post more or less on politics in the future? Will the move left or right politically?
When it comes to coding AIs you can predict all sorts of questions about how a code based will evolve in the future. You can forecast whether or not unit tests will fail after a given change.
Whenever you ask the AI to make decisions that have external consequences you can make it forecast the consequences.
(What I’m writing here has obvious implications for building capabilities, but I would expect people at the labs to be smart enough to have these thoughts on their own—if there’s anyone who thinks I shouldn’t write like this please tell me)
The Case Against AI Control Research seems related. TL;DR: mainline scenario is that hallucination machine is overconfident about it’s own alignment solution, then it gets implemented without much checking, then doom.
This feels like a thing that shows up in evals before the AI takes catastrophic actions.
I guess the AI could be “aligned” in some sense but not corrigible / truthful, instrumentally hide its overconfidence, and then take catastrophic actions