The safe simulation problem is to start with some dynamical physical process which would, if run long enough in some specified environment, produce some trustworthy information of great value, and to compute some adequate simulation of faster than the physical process could have run. In this context, the term “adequate” is value-laden—it means that whatever we would use for, using instead produces within epsilon of the expected value we could have gotten from using the real In more concrete terms, for example, we might want to tell a Task AGI “upload this human and run them as a simulation”, and we don’t want some tiny systematic skew in how the Task AGI models serotonin to turn the human into a psychopath, which is a bad (value-destroying) simulation fault. Perfect simulation will be out of the question; the brain is almost certainly a chaotic system and hence we can’t hope to produce exactly the same result as a biological brain. The question, then, is what kind not-exactly-the-same-result the simulation is allowed to produce.
As with “low impact” hopefully being lower-complexity than “low bad impact”, we might hope to get an adequate simulation via some notion of faithful simulation, which rules out bumps in serotonin that turn the upload into a psychopath, while possibly also ruling out any number of other changes we wouldn’t see as important; with this notion of “faithfulness” still being permissive enough to allow the simulation to take place at a level above individual quarks. On whatever computing power is available—possibly nanocomputers, if the brain was scanned via molecular nanotechnology—the upload must be runnable fast enough to make the simulation task worthwhile.
Since the main use for the notion of “faithful simulation” currently appears to be identifying a safe plan for uploading one or more humans as a pivotal act, we might also consider this problem in conjunction with the special case of wanting to avoid mindcrime. In other words, we’d like a criterion of faithful simulation which the AGI can compute without it needing to observe millions of hypothetical simulated brains for ten seconds apiece, which could constitute creating millions of people and killing them ten seconds later. We’d much prefer, e.g., a criterion of faithful simulation of individual neurons and synapses between them up to the level of, say, two interacting cortical columns, such that we could be confident that in aggregate the faithful simulation of the neurons would correspond to the faithful simulation of whole human brains. This way the AGI would not need to think about or simulate whole brains in order to verify that an uploading procedure would produce a faithful simulation, and mindcrime could be avoided.
Note that the notion of a “functional property” of the brain—seeing the neurons as computing something important, and not wanting to disturb the computation—is still value-laden. It involves regarding the brain as a means to a computational end, and what we see as the important computational end is value-laden, given that chaos guarantees the input-output relation won’t be exactly the same. The brain can equally be seen as implicitly computing, say, the parity of the number of synapse activations; it’s just that we don’t see this functional property as a valuable one that we want to preserve.
To the extent that some notion of function might be invoked in a notion of faithful, permitted speedups, we should hope that rather than needing the AGI to understand the high-level functional properties of the brain and which details we thought were too important to simplify, it might be enough to understand a ‘functional’ model of individual neurons and synapses, with the resulting transform of the uploaded brain still allowing for a pivotal speedup and knowably-faithful simulation of the larger brain.
At the same time, strictly local measures of faithfulness seem problematic if they can conceal systematic larger divergences. We might think that any perturbation of a simulated neuron which has as little effect as adding one phonon is “within thermal uncertainty” and therefore unimportant, but if all of these perturbations are pointing in the same direction relative to some larger functional property, the difference might be very significant. Similarly if all simulated synapses released slightly more serotonin, rather than releasing slightly more or less serotonin in no particular systematic pattern.
One natural standard: it should be hard to distinguish an adequate model from the system-to-be-modeled, based on input/output behavior alone.
How hard? Ideally we’d have an “equally competent” modeler and distinguisher, and ask the modeler to try to fool the distinguisher. This is a popular approach to generative modeling, and something I’ve talked about in the context of AI control (as has Jessica).
This definition runs into many subtleties, but I think it is a natural starting point for a discussion. In particular, we are already way beyond concerns like “the brain is almost certainly a chaotic system and hence we can’t hope to produce exactly the same result as a biological brain.”
The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as “probably easy if the agent is powerful and the difference is really important” and you would classify as “way too hard to count on.”
You could also ask the model to output various intermediate results or to simulate requested measurements on the simulated brain, and give this extra information to the distinguisher. (Though I don’t think this would really help.)
Methodologically, I am trying to understand what approaches may or may not work and what the key difficulties are. I am trying to anticipate what problems are hard or easy in order to understand what approaches may or may not work. I wouldn’t describe this as “taking things for granted,” I think we are probably miscommunicating.
This is a big problem, I think that it’s the more real version of “perfect simulation will be out of the question.” Note that this is only a concern for some processes (e.g. if the simulation output is one bit, then you don’t have this problem).
(Note that in practice generative adversarial models are extremely finicky to train, at least partly for this reason.)
I think the other big problem is the complementary one, that even an equally smart adversary can’t reliably distinguish a crappy simulation from a good simulation (where a dumb example is that no distinguisher can detect a steganographically encoded message even though that implies the simulation was poor).
Counting on things before you’ve found a solution to them isn’t very mindset, but I do consider this a promising approach. Definitely, the generative-adversarial approach in modern neural networks causes me to hope that this is the sort of thing that actually works in practice. So I might not be as pessimistic as you think? I still think in general that one does not go about taking things for granted, but the notion of faithful simulation seems like one that could prove to have a tractable core after hammering on it for a bit, and it also seems very possible that if you’re reasonably smart and you can’t detect any expected differences in the behavior of neural columns then the corresponding human simulation is faithful.
My current thoughts on possible failure modes:
“No differences you know about” might mix up the map and the territority in some obscurely fatal way that leads to the equivalent of the AI deliberately managing to ‘not know’ about inconvenient divergences.
If we use a limited AI and don’t let it run thousands of simulations of people that it can compare to thousands of brains in vats, then in practice its column-level tests won’t detect cumulative neural-level differences that lead to an 80% probability of schizophrenia.
The adversarial approach as written won’t work because it will turn out that it’s always possible for an equally smart adversary to tell the difference, especially for simulations that can be computed at a worthwhile speedup. Which means this test won’t meaningfully discriminate in the region of intuitively faithful vs. nonfaithful simulations. (This strikes me as the sort of issue that’s repairable, but perhaps not trivially so.)