MIRI’s “The Problem” hinges on diagnostic dilution
[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point]
I’m going to meet someone new tomorrow, and I’m wondering how many kids they have. I know their number of kids is a nonnegative 32 bit integer, and almost all such numbers are greater than 1 million. So I suppose they’re highly likely to have more than 1 million kids.
This is an instance of a fallacy I’m calling diagnostic dilution[1]. It is akin to a probabilistic form of the fallacy of the undistributed middle. The error goes like this: we want to evaluate some outcome given some information . We note that implies (or nearly so), and so we decide to evaluate . But this is a mistake! Because implies we are strictly better off evaluating [2]. We’ve substituted a sharp condition for a vague one, hence the name diagnostic dilution.
In the opening paragraph we have =”count > 1 million”, =”we’re counting kids” and =”count is a nonnegative 32 bit integer”. So is and is close to 1. Diagnostic dilution has lead us far astray.
Diagnostic dilution is always structurally invalid, but it misleads us specifically when the conclusion hinges on forgetting . “Bob is a practising doctor, so Bob can prescribe medications, so Bob can probably prescribe Ozempic” is a fine inference because it goes through without forgetting that Bob is a doctor. But consider the same chain starting with “Bob is a vet”: now the inference does not go through unless we forget Bob’s job, and indeed the conclusion is false.
Diagnostic dilution is tricky because it looks a bit like a valid logical inference: and so . But if is even a hair less than 1, then, as we saw, we can end up being very wrong.
Quick definition: Diagnostic dilution is a probabilistic reasoning error that involves replacing the condition with a weaker condition and drawing a conclusion that follows from conditioning on but doesn’t follow from conditioning on .
Now I’m going to consider a caricature of the argument presented in section 2+4 of The Problem. Hopefully we won’t have to haggle the details of its representativeness too much:
: We have an AI that autonomously builds football stadiums.
: This will exhibit many goal-directed behaviours.
: By the instrumental convergence thesis, systems that pursue goals often seek to eliminate threatening agents,
so a football stadium building AI is likely to seek to eliminate threatening agents
Now this argument is clearly structurally unsound, but does the conclusion actually depend on forgetting ? It does. follows from by imagining a stadium builder, “observing its behaviour” and noting that it seemingly has to do things like
dealing with unreliable suppliers
negotiate complex planning issues with a broad set of parties with diverse interests
making tradeoffs about where to put resources that suggest it prefers a completed stadium to a really amazing statue out the front
etc.
However, we can apply the same method of imaginary behaviour observation to note that the stadium builder does not have to eliminate threatening agents. The machinery that got us from does not get us to a high probability of from . So this argument takes another path: invoking the instrumental convergence thesis to say that, ignoring the specific premise , goal-directed agents in general have a high probability of a conclusion which depends on forgetting [3].
Maybe there are technical reasons that make the conclusion correct. Even if this is so, it does not make the presented argument sound. Instead we have a different argument:
: We have an AI that autonomously builds football stadiums.
: It will exhibit many goal-directed behaviours.
: By the instrumental convergence thesis, systems that pursue goals often seek to eliminate threatening agents,
and for technical reasons this is likely to apply to also,
so a football stadium building AI is likely to seek to eliminate threatening agents
Section 2 of “The Problem” offers another similar argument in terms of “strategizing” instead of goal directedness, but it commits the same error of diagnostic dilution: start with a specific kind of agent that does something complex and valuable, note that this has to involve strategising, switch to considering things that strategize “in general”[4] and then via something like instrumental convergence conclude that they’ll wipe out humans if they can.
Technical reasons why the conclusion might hold
There are two ways that I see to rescue the conclusion. One option: instead of imagining AIs that build football stadiums, we imagine AIs that build football stadiums while under attack (which could be, for example, cyber, kinetic or legal). It seems intuitive that such systems, if they are to be successful, have to in some sense seek to eliminate threats. Many people talk about AI as a tool to secure geopolitical dominance, and so “strong AI will be under attack in various ways” is not an especially far fetched claim. I think this is a legitimate worry, but it doesn’t seem to be central to MIRI’s threat model at least. We need to worry about diagnostic dilution here too: an AI that successfully defends itself from adversarial cyber attacks need not defend itself from directions from legitimate authorities; even if we have adversarial pressure of a particular type, we are not licensed to substitute “agents acting under adversarial pressure in general”.
Another option, which I think is closer to MIRI’s thinking, is some kind of agency generalisation hypothesis. The hypothesis goes something like this: if an AI is smart enough and has some markers of goal-directedness, then via generalization it picks up characteristics of the general class of goal-directed agents (as imagined by the instrumental convergence thesis). The minimal point I am making in this article is that if they are persuaded by a hypothesis like this, then it ought to appear in the original argument. But I have a less minimal point to make: I am also doubtful that this hypothesis is sound.
How do we get an AI football stadium builder? Here are two plausible proposals[5]:
Cooperative AI systems: we take an AI project manager which assembles an AI legal team, an AI construction team etc., all individually proven on many smaller and cheaper projects
Simulation: we construct a high fidelity simulation of all of the relevant systems in a city and have an AI learn to construct football stadiums by reinforcement learning
In the first case, we have a swarm of AI systems each with well-characterised behaviour and each operating fairly closely to its training/evaluation environment. Yes, the overall project can be significantly out of distribution for the whole system, but each subsystem is operating in a regime where its behaviour is well characterised (in fact, this property is probably what enables large ensembles to work at all). So once we know how all the components operate, there’s little room for additional “goal-directed properties of agents in general” to creep into the system[6].
In the second case, we can just look at the behaviours it exhibits in the simulation. Now there will be differences between the sim and the world, so this will not be a perfect indication of real world behaviour. It’s possible that in the transition from the simulation to the real world it applies behaviours learned (or generalized) in the sim in ways that are dangerous in the real world, but it is even more likely to exhibit learned behaviours that are not effective in the real world. If the simulation approach is to be effective, it probably has to have pretty high fidelity, in which case sim behaviours are likely to be pretty representative of the real world behaviour[7].
In both cases, there’s no “agency gap” that needs to be explained by broad generalization of agency. All of the goal-directed behaviours needed to accomplish the task are explained by narrow generalization from the training and architecture of the AI system that builds the stadium. So the goal-directedness generalization hypothesis does not help to explain the admittedly imaginary outcome of this admittedly imaginary experiment. Nevertheless, perhaps there is simply a tendency for smart agents to pick up features of “agency in general”, even though it is not strictly required for them to perform – something like the mechanism explored by Evan Hubinger in How Likely is Deceptive Alignment?
Without meaning to be too harsh on that piece – which I actually rather like – theories about machine learning on that scale of ambition do not have a high success rate. There is a reason why the field has converged on expecting empirical validation of even very modest theoretical results. If the case for doom hinges on that argument or a close relative, it is a serious error of omission not to mention it.
There is actually some empirical support for some kind of agency generalization thesis, though I think it rather weak. In particular, we have Anthropic’s agentic misgeneralization work. My rough understanding is that if you ask recent LLMs (like Claude 4) to “focus on achieving your goal above all else” and put them in a situation where they will either be shut off and fail to achieve their goal, or they have to blackmail or murder someone, they will reasonably often choose blackmail or murder. Also, relevant to the question here, this seems to apply more to more recent models which are able to accomplish longer-horizon tasks. The obvious rejoinder is that you have to ask for this behaviour for it to appear, but there’s still the weak appearance of a trend.
On the other hand, recent systems definitely seem motivated to create code that passes validation. They do it successfully fairly often, even in challenging situations. Furthermore, they’ve been noted to sometimes cheat on validation tests for code they found to hard to implement, and to write code in such a manner that it continues to run even if it contains errors. So this looks like motivated behaviour – they want to write successful code. But if that’s what they really want, there are a bunch of other things they could do:
Convince users to give them easier specs
Quietly revise specs down
Dissuade users from changing specs halfway through
Dissuade users from imposing strict validation requirements
Convince users to ask for code even when it’s not needed But I’ve used these systems a lot and never encountered any of that. So even though there are strong indications of goal-directedness here, we do quite poorly if we interpret it as an overly broad tendency to pursue the apparent goal. If agency really does generalize broadly, it seems we’re still far from seeing it happen, even in a restricted domain where the models are particularly agentic.
TL;DR
When you’re making an argument, don’t weaken your premises if you don’t have to
MIRI’s argument for goal-directedness & threat elimination does this
Need to consider additional technical details to see if their argument stands up
Strong AI + war is a worrying combination
I don’t find the case for goal-directedness from generalization compelling
- ↩︎
Please accept my sincere apologies if it is already named.
- ↩︎
Ask your favourite LLM why this is if it’s not immediately obvious.
- ↩︎
Alternatively, just as we erred by evaluating instead of , we err by evaluating instead of .
- ↩︎
For the record, I think the notion of “the general class of things that pursue goals” is suspect, but I’m happy to grant it for the sake of argument here. But seriously, you need an actual prior and not a pretend one.
- ↩︎
I’m not putting these forward as safe ways to get an AI stadium builder – I’m putting them forward as ways you might get an AI stadium builder at all.
- ↩︎
In principle there’s room for some kind of egregore to emerge, but I don’t obviously see an account of why this is likely here, nor why arguments about goal-directed agents in general apply to it.
- ↩︎
I’m kind of expecting here a lot of responses along the lines “a superintelligence could totally behave very differently out of sim if it wanted to” which is true, but a) I think we should try to get a grip on the somewhat-superhuman case before we go to the wildly superhuman case and b) whether it “wants to” is the entire point of contention.
TL;DR
You’ve taken a task that’s simple enough to be done by humans, so outer misalignment failures don’t matter, and you’ve just assumed away any inner misalignment failures, which again is only possible because the chosen task is easy, compared to really critical tasks we’d need an ASI to do.
Build a Football Stadium
You’ve made a bit of a reference-class error by choosing something which humans can already do perfectly well, and can easily verify.
As you pointed out, this task can be broken down into smaller tasks that predictable AIs can do in predictable ways. Alternatively it could (in theory) be done by building a simulator to do simple RL in, which requires us to understand the dynamics of the underlying system in great detail.
For genuinely difficult tasks, that humans can’t do, you don’t really have those options, your options look more like
Use a system made out of smaller components. Since you don’t know how to do the task, this will require allowing the AIs to spin up other AIs as needed, and it also might require AIs to build new components to do tasks you didn’t anticipate. So you just end up with a big multi-AI blob which you don’t understand.
Build an RL agent to do it, training in the real world (because the required simulator would be too difficult to build) and/or generalizing across tasks.
This runs into misalignment problems, as already stated
Outer Misalignment Problems
An agent which “Builds football stadiums” in the real world has to deal with certain adversarial conditions: malicious regulators; trespassers and squatters; scammy contractors; so it needs to have some level of “deal with threats” built in. Specifying exactly which threats to deal with (and how to deal with them) might be possible for something non-pivotal like “Build a football stadium” but for tasks outside the scope of human actions this is an extremely hard problem.
Imagine specifying the exact adversarial conditions that an AI should deal with vs should not deal with in order to do something really important like “Liberate the people of North Korea” or “Eliminate all diseases”.
Inner Misalignment Problems
An idealized AI which you have already specified to be just building football stadiums is all well and good, but that’s not how we get AIs. What you do is draw a random AI from a distribution specified by “Looks sufficiently like it just builds football stadiums, in training”.
There are lots of things which behave nicely in training, but not when deployed.
You can maybe make this work when you have a goal like “build football stadium” where a sufficiently high-quality simulation can be created, but how are you going to make this work for goals which are too complex for us to build a simulation of?
The class of “things which pursue goals”
This is actually a fair point, but I disagree that this is a reasonable reference class.
I think o-series models roughly, currently, look like a big pile of goal-pursuing-ness. (See the comment by nostalgebraiat here) You give them a prompt, they attempt to infer the corresponding RLVR task, and they pursue that goal, and this generalizes to tasks with no exact RLVR outcome.
I haven’t assumed away inner misalignment, the unwanted emergence of instrumentally convergent type goal seeking is a key premise of inner misalignment theory. I briefly mention Hubingers work specifically too.
I’d say disease elimination is quantitatively but not qualitatively different to stadium building. “Eliminating threatening agents” still doesn’t make the list of necessities; there’s still a lot of exceptionally good but recognisable science, negotiation, project management, office politics to do. NK type issues I sort of call out as riskier.
(It’s also not obvious that disease elimination simulation is harder than football stadium simulation, but both seem kind of sci fi to me tbh)
You posited a stadium building AI and noting that “the stadium builder does not have to eliminate threatening agents”. I agree. Such elimination is perhaps even counterproductive to it’s goal.
However, there are two important differences between stadium building AI and the dangerous AIs described in “The Problem”. The first is, you assume that we correctly managed to instill the goal of stadium building into the AI. But in “The Problem”, the authors specifically talk in section 3 - which you skipped in your summary—about how bad we are at installing goals in AIs. So consider if instead of instilling the goal of building stadiums legally, we accidentally instilled the goal of building studios regardless of legality. In such a case, assuming it could get away with it the AI could threaten the mayor to give a permit or hire gangs to harass people on the relevant plot of land.
The second is difference between your example and “The Problem” is horizon length. You gave an example of a goal with an end point, that could be achieved relatively soon. Imagine instead the AI wanted to run a stadium in a way that was clean and maximized income. All of a sudden taking over the global economy, if you can get away with it, sounds like a much better idea. The AI would need to make decisions about what is considered stadium income, but then you can funnel as much of the global economy as you want into or through the stadium by say making the tickets the standard currency or forcing people to buy tickets or switching the people for robots that obsessively use the stadium or making the stadium insanely large or a thousand other things I haven’t thought of. More abstractly: subgoals instrumentally converge as the time horizon of the goal goes to infinity.
So basically—an agent with a long-term goal that isn’t exactly what you want and can run intellectual circles around all of humanity put together is dangerous.
I do appreciate everyone who takes the time to engage but this response is a bit frustrating. I don’t assume anywhere the AI has a goal of stadium building—I assume that it builds stadiums (which is a version of the assumption introduced by MIRI that the AI does complex long horizon stuff). The argument then goes: yes, an AI that builds stadiums is going to look like it does a bunch of goal directed stuff, but it doesn’t follow that it needs to have exactly the right kind of “stadium building goal” or else instrumental convergence related disaster strikes—that step is actually a reasoning error. Given that this step fails, we haven’t established why it matters that it’s hard to instil “the right kind of goals”, so I don’t engage with that part of the argument—I’m dealing with the logically prior step.
I know the argument is not easy, especially if you lack a background in probability, but it is precise and I worked quite hard on making it as clearly as I could.
Please let me try again.
Given three events A, X, and Y, where X⊆Y,
P(Ac∣X)P(X∣Y)=(P(Ac∩X)/P(X))(P(X∩Y)/P(Y))=
=P(Ac∩X)/P(Y)≤P(Ac∩Y)/P(Y)=P(Ac∣Y)
since X⊆Y⇒X∩Y=X and Ac∩X⊆Ac∩Y.
But this means
P(X∣Y)≤P(Ac∣Y)/P(Ac∣X)
So if we accept that P(Ac∣Y)≈0 and P(Ac∣X)>>0 (that is, for there to be a significant difference between P(A|Y) and P(A|X,Y)) P(X∣Y) must be very small.
So X must be an ultra specific subset of Y.
If I call the vetrinary department and report the tiger in my back yard (X), and the personnel is sent to deal with a feline (Y), and naturally expects something nonthreatening (Ac), they will be unpleasantly surprised (A). So losing important details is a bad idea, and this requires that tigers be a vanishingly small portion of the felines they meet on a day to day (P(X|Y)~=0).
All this having been said, it seems like you accepted (for conversation’s sake at least) that doing long horizon stuff implies instrumental goals (X⊆Y), and that instrumental goals mostly imply doom and gloom (P(A|Y)≈1). So the underlying question is: are entities that do complex long horizon stuff unusual examples of entities that act instrumentally (such that P(X|Y) is small)? Or alternatively: when we lose information are we losing relevant information?
I think not. Entities that do long horizon stuff are the canonical example of entities that act instrumentally. I struggle to see what relevant information we could be losing by modeling a long horizon achiever as instrumentally motivated.
At this point, in reading your post I get hung up on the example. We are losing important information, I understood between the lines, since “the stadium builder does not have to eliminate threatening agents”. But either (1) yes, he does, obviously getting people who don’t want a stadium out of the way is a useful thing for it to do, and thus we didn’t actually lose the important information; or (2) this indeed isn’t a typical example of an instrumentally converging entity, nor is it a typical example of an entity that does complex long horizon stuff, of the type I’m worried about because of The Problem, because I’m worried about entities with longer horizons.
Is there a particular generalization of the stadium builder that makes it clearer what relevant information we lost?
There are still a few issues here. First, instrumental convergence is motivated by a particular (vague) measure on goal directed agents. Any behaviour that completes any particular long horizon task (or even any list of such behaviours) will have tiny measure here, so low P(X|Y) is easily satisfied.
Secondly, just because “long horizon task achievement” is typical of instrumental convergence doom agents, it does not follow that instrumental convergence doom is typical of long horizon task achievers.
The task I chose seems to be a sticking point, but you totally can pick a task you like better and try to run the arguments for it yourself.
You need to banish all the talk of “isn’t it incentivised to…?”, it’s not justified yet.
I was going with X being the event of any entity that is doing long horizon things, not a specific one. As such, small P(X|Y) is not so trivially satisfied. I agree this is vague, and if you could make it specific that would be a great paper.
Sure, typicality isn’t symmetrical—but the assumptions above (X is a subset of Y, P(A|Y)~=1) mean that I’m interested whether “long horizon task achievement” is typical of instrumental convergence doom agents not the other way around. In other words, I’m checking whether P(X|Y) is large or small.
Make money. Make lots of spiral shaped molecules (colloquially known as paper clips). Build stadiums where more is better. Explore the universe. Really any task that does not have an end condition (and isn’t “keep humans alive and well”) is an issue.
Regarding this last point, could you explain further? We are positing an entity that acts as though it has a purpose, right? It is eg moving the universe towards a state with more stadiums. Why not model it using “incentives”?
You do realise that the core messages of my post are a) that weakening your premise and forgetting the original can lead you astray, and b) the move from “performs useful long horizon action” to “can reason about as if it is a stereotyped goal driven agent” is blocked, right? Of course if you reproduce the error and assume the conclusion you will find plenty to disagree with.
By way of clarification: there are two different data generating processes, both thought experiments: one proposes useful things we’d like advanced AI to do. This leads to things like the stadium builder. The other proposes “agents” of a certain type and leads to the instrumental convergence thesis. What you can get is that you can choose a set of high probability according to the first process that ends up being low probability according to the second.
You are not proposing tasks you are proposing goals. A task here is like “suppose it successfully Xed”, without commitment to whether it wants X or in what way it wants X.
This post is difficult for me to follow because
your use of types is inconsistent (the A, X, Y should be hypotheses I believe, but you’re saying stuff like this, which makes it sound like they’re half of hypotheses, or other objects---)
you say “I know their number of kids is a nonnegative 32 bit integer, and almost all such numbers are greater than 1 million. So I suppose they’re highly likely to have more than 1 million kids.” But this isn’t true; the premise you’re actually using here is that the number is a 32 bit integer with a uniform distribution. Just noting that an unknown element is in a set doesn’t itself imply anything about its probability distribution.
you’re not quoting the exact argument you’re criticizing
Maybe these are all just formalities that have no bearing on the validity of the critique, but at least for me they make it too difficult to figure out whether I agree with you or not to make it worth it, so I’m just giving up.
Hey thanks for making the effort.
Letters are propositions—things that could be true or false. Would it help if I wrote them out more explicitly rather than tagged them into sentences?
Yes I have to choose a distribution but if I’m forced to predict an unknown int32 with no additional information the uniform distribution seems like a reasonable choice. Ad-hoc, not explicitly defined probability distributions are common in this discussion.
I hear you about quoting, but their argument is kind of scattered so it was hard to quote without being too distracting and/or looking like I’m being excessively selective. Plus I really wanted to discuss a concrete complex project not an abstract one. So I decided just to foreground that it’s my version and let people decide if it’s a good enough rendition.
Yes, the edit definitely makes it better.
Well it’s clearly not a reasonable choice given that it results in a fallacy. I think if the actual error is using the wrong prior distribution then this should be reflected in what the diagnostic dilution fallacy is defined as. It’s not putting the element into the larger set because that isn’t false. I’d suggest a phrasing but I don’t have a good enough grasp on the concept to do this, and I’m still not even sure that the argument you’re criticizing is in fact an example of this fallacy if were phrased more precisely. (Also nitpicky, but “condition” in the definition is odd as well if they’re general hypotheses.)
The form of the fallacy is “forgetting” the first proposition X in favour of a weaker one Y. In the example, this means we put ourselves in the position of answering “you will be given a nonnegative int32 tomorrow, no further clues, predict what it is”. How would you choose a distribution in this situation?
Empirically, most uses of values of 32-bit integers (e.g. in programming in general) have magnitudes very much smaller than 2^30. A significant proportion of their empirical values are between 0 and 10. So my prior would be very much non-uniform.
I would expect something like a power law, possibly on the order of P(X >= x) ~ x^-0.2 or so except with some smaller chance of negative values as well, maybe 10-20% of that of positive ones (which the statement “nonnegative” eliminates), and a spike at zero of about 20-40% of total probability. It should also have some smaller spikes at powers of two and ten, and also one less than those.
This would still be a bad estimate for the number of children, once I found out that you were actually talking about children, but not by nearly as much.
Now you’re sort of asking me to do the work for you, but I did get interested enough to start thinking about it, so here’s my more invested take.
So first of all, I don’t see how this is a form of the fallacy of the undistributed middle. The article you linked to says that we’re taking A⟹C and B⟹C and conclude A⟹B. I don’t see how your fallacy is doing (a probabilistic version of) that. Your fallacy is taking X"⟹"Y as given (with "⟹" meaning “makes more likely”), and Y"⟹"A and concluding X"⟹"A
Second
I think we’ve substituted a vague condition for a sharp one, not vice-versa? The 32 bit integer seems a lot more vague than the number being about kids?
Third, your leading example isn’t an example of this fallacy, and I think you only got away with pretending it’s one by being vague about the distribution. Because if we tried to fix it, it would have to be like this
A: the number is probably > 100000
X: the number is a typical prior for having kids
Y: the number is a roughly uniformly distributed 32 bit integer
And X"⟹"Y is not actually true here. Whereas in the example you’re criticizing
A: the AI will have seek to eliminate threatening agents
X: the AI builds football stadiums
Y: the AI has goal-directed behavior
here X"⟹"Y does seem to be true.
(And also I believe the fallacy isn’t even a fallacy because if X"⟹"Y and Y"⟹"A together do in fact imply X"⟹"A, at least if both "⟹" are sufficiently strong?)
So my conclusion here is that the argument actually just doesn’t work, or I still just don’t get what you’re asserting,[1] but the example you make does not seem to have the same problems as the example you’re criticizing, and neither of them seems to have the same structure as the example of the fallacy you’re linking. (For transparency, I initially weak-downvoted because the post seemed confusing and I wasn’t sure if it’s valid, then removed the downvote because you improved the presentation, now strong-downvoted because the central argument seems just broken to me now.)
like maybe the fallacy isn’t about propositions implying each other but instead about something more specific to an element being in a set, but at this point the point is just not argued clearly.
You’ve just substituted a different proposition and then claimed that the implication doesn’t hold because it doesn’t hold for your alternative proposition. “We’re counting kids” absolutely implies “the count can be represented by a nonnegative int32”. If I want to show that an argument is unsound I am allowed to choose the propositions that demonstrate it’s unsoundness.
The X⟹Y implication is valid in your formulation, but then Y doesn’t imply anything because it says nothing about the distribution. I’m saying that if you change Y to actually support your Y⟹A conclusion, then X⟹Y fails. Either way the entire argument doesn’t seem to work.
Sorry but this is nonsense. JBlack’s comment shows the argument works fine even if you take a lot of trouble to construct P(count|Y) to give a better answer.
But this isn’t even particularly important, because for your objection to stand, it must be impossible to find any situation where P(A|Y) would give you a silly answer, which is completely false.
if I’m using deductive reasoning, your point seems sound, and so I’d like to see them update their post to strengthen the weak link.
if I’m using abductive reasoning, the argument they seem to be trying to make seems likely to have a sound version that reaches the conclusions they claim it does. I expect the missing link will have something to do with competitive pressure against versions of the model that can’t defend against a source of failures, from versions that can.
Given that alignment is theoretically solvable, (probably) and not currently solved, almost any argument about alignment failure is going to have an
“and the programmers didn’t have a giant breakthrough at the last minute” assumption.
Yes. I expect that, before smart AI does competent harmful actions (as opposed to flailing randomly, which can also do some damage), then there will exist, somewhere within the AI, a pretty detailed simulation of what is going to happen.
Reasons humans might not read the simulation and shut it down.
A previous competent harmful action intended to prevent this.
The sheer number of possibilities the AI considers actions.
Default difficulty of a human understanding the simulation.
Lets consider an optimistic case. You have found a magic computer and have programmed the laws of quantum field theory. You have added various features so you can put a virtual camera and microphone at any point in the simulation. Lets say you have a full VR setup. There is still a lot of room for all sorts of subtle indirect bad effects to slip under the radar. Because the world is a big place and you can’t watch all of it.
Also, you risk any prediction of a future infohazard becoming a current day infohazard.
In the other extreme, it’s a total black box. Some utterly inscrutable computation, perhaps learned from training data. Well in the worst case, the whole AI, from data in to action out, is one big holomorphically encrypted black box.
I think you meant to say something different with this paragraph, or I am confused:
This does not seem to be an example of “diagnostic dilution”
X: Bob is a doctor (vet) Y: Bob can prescribe medicine A: Bob can prescribe Ozempic
“Bob is a doctor so Bob can prescribe medicine. Bob can prescribe medicine so Bob can probably prescribe Ozempic.” (Unsound, but turns out to be fine bc conclusion is about the same as “Bob is a doctor who can prescribe medicine, so Bob can probably prescribe Ozempic”, which is the sound version of the argument.)
“Bob is a vet so Bob can prescribe medicine. Bob can prescribe medicine so Bob can probably prescribe Ozempic.” Now we see why the argument is unsound—it can go wrong given a different premise.
Ok it does seem like an example then. Thank you for spelling it out.