Suppose that I convinced you “if you didn’t know much chemistry, you would expect this AI to yield good outcomes.” I think you should be pretty happy. It may be that the AI would predictably cause a chemistry-related disaster in a way that would be obvious to you if you knew chemistry, but overall I think you should expect not to have a safety problem.
This feels like an artifact of a deficient definition, I should never end up with a lemma like “if you didn’t know much chemistry, you’d expect this AI to to yield good outcomes” rather than being able to directly say what we want to say.
That said, I do see some appeal in proving things like “I expect running this AI to be good,” and if we are ever going to prove such statements they are probably going to need to be from some impoverished perspective (since it’s too hard to bring all of the facts about our actual epistemic state into such a proof), so I don’t think it’s totally insane.
If we had a system that is ascription universal from some impoverished perspective, you may or may not be OK. I’m not really worrying about it; I expect this definition to change before the point where I literally end up with a system that is ascription universal from some impoverished perspective, and this definition seems good enough to guide next research steps.
In order to satisfy this definition, 𝔼¹ needs to know every particular fact 𝔼 knows. It would be nice to have a definition that got at the heart of the matter while relaxing this requirement.
I don’t think your condition gets around this requirement. Suppose that Y is a bit that 𝔼 knows and 𝔼¹ does not, Z and Z are two hard-to-estimate quantities (that 𝔼¹ and 𝔼² know but 𝔼 does not), and that X=Z[Y].
Perhaps you say “these cells are too simple, they can’t learn/reflect/etc.” Well, chances are humans will have the same issue once the computational burden gets large enough.
I don’t think the situations is symmetrical here.
Humans have easy-to-extract preferences over possible “wiser versions of ourselves.” That is, you can give me a menu of slightly modified versions of myself, and I can try to figure out which of those best capture my real values (or over what kind of process should be used for picking which of those best capture my real values, or etc.). Those wiser versions of ourselves can in turn have preferences over even wiser/smarter versions of ourselves, and we can hope that the process might go on ad infinitum.
It may be that the process with humans eventually hits a ceiling—we prefer that we become smarter and wiser in some obvious ways, but then eventually we’ve picked the low hanging fruit and we are at a loss for thinking about how to change without compromising our values. Or it may be that we are wrong about our preferences, and that iterating this deliberative process goes somewhere crazy.
But those are pretty fundamentally different from the situation with E. coli, where we have no way to even get the process started. In particular, the difficulty of running the process with E. coli doesn’t give us much information about whether the process with humans would top out or go off the rails, once we know that humans are able to get the process started.
What does “expect” mean here? Probability > .5?
We don’t think that C has any expected edge in identifying an attack that A does not—if C is up to something, and A can’t tell, then we think that the “being up to something” provides no evidence about whether something bad is about to happen. This is much stronger than a fixed probability of failure.
Also, can you walk through an example of how C might “trick” A[C] and how subjective dominance implies that we don’t expect C to succeed?
Simple caricatured examples:
C might propose a design for a computer that has a backdoor that an attacker can use to take over the computer. But if this backdoor will actually be effective, then A[C] will know about it.
C might propose a design that exploits a predictable flaw in A’s reasoning (e.g. overlooking consequences of a certain kind, being overly optimistic about some kinds of activities, incorrectly equating two importantly different quantities...). But then A[C] will know about it, and so if A[C] actually reasons in that way then (in some sense) it is endorsed.
I don’t know how to make sense of this. If an impoverished perspective expects C not to be able to “trick” A, it seems kind of obvious that’s not good enough for me to think the same?
If an impoverished perspective expects C not to be able to trick A, then either:
In fact C won’t be able to trick A.
C will trick A, but the perspective is too weak to tell.
I think I don’t quite understand what you are saying here, what exactly is obvious?
From a suitably advanced perspective it’s obvious that C will be able to trick A sometimes—it will just get “epistemically lucky” and make an assumption that A regards as silly but turns out to be right.
I’m aiming for things like:
max-HCH with budget kn dominating max-HCH with budget n for some constant k>1.
HCH with advice and budget kn dominating HCH with no advice and budget n.
does an algorithm that adds two numbers have a belief about the rules of addition? Does a GIF to JPEG converter have a belief about which image format is “better”?
I’m not assuming any fact of the matter about what beliefs an system has. I’m quantifying over all “reasonable” ways of ascribing beliefs. So the only question is which ascription procedures are reasonable.
I think the most natural definition is to allow an ascription procedure to ascribe arbitrary fixed beliefs. That is, we can say that an addition algorithm has beliefs about the rules of addition, or about what kinds of operations will please God, or about what kinds of triples of numbers are aesthetically appealing, or whatever you like.
Universality requires dominating the beliefs produced by any reasonable ascription procedure, and adding particular arbitrary beliefs doesn’t make an ascription procedure harder to dominate (so it doesn’t really matter if we allow the procedures in the last paragraph as reasonable). The only thing that makes it hard to dominate C is the fact that C can do actual work that causes its beliefs to be accurate.
their inner workings are not immediately obvious
OK, consider the theorem prover that randomly searches over proofs then?
C is an arbitrary computation, to be universal the humans must be better informed than *any* simple enough computation C.
The examples in the post are a chess-playing algorithm, image classification, and (more fleshed out) deduction, physics modeling, and the SDP solver
The deduction case is probably the simplest; our system is manipulating a bunch of explicitly-represented facts according to the normal rules of logic, we ascribe beliefs in the obvious way (i.e. if it deduces X, we say it believes X).
I think all the three estimates mentioned there correspond to marginal probabilities (rather than probabilities conditioned on “no governance interventions”). So those estimates already account for scenarios in which governance interventions save the world. Therefore, it seems we should not strongly update against the necessity of governance interventions due to those estimates being optimistic
I normally give ~50% as my probability we’d be fine without any kind of coordination.
Why use IRL instead of behavioral cloning, where you mimic the actions that the demonstrator took?
IRL also can produce different actions at equilibrium (given finite capacity), it’s not merely an inductive bias.
E.g. suppose the human does X half the time and Y half the time, and the agent can predict the details of X but not Y. Behavioral cloning then does X half the time, and half the time does some crazy thing where it’s trying to predict Y but can’t. IRL will just learn that it can get OK reward by outputting X (since otherwise the human wouldn’t do it) and will avoid trying to do things it can’t predict.
I agree there is a real sense in which AGZ is “better-grounded” (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)
Uncertainty about future aid introduces a cost, and certainly recipients will be better off if aid is predictable.
But if there were no externalities from production, then I think the presence of variable aid still always makes you better off on average than no aid. Worst case, you need to invest in net-producing capacity anyway (in case aid disappears), which you can finance by charging higher prices if free nets disappear.
The main problem with that is that if aid disappears, there will be a wealth transfer from net consumers to net producers. Given risk aversion, that stochastic transfer is bad for everyone. So you’d either want to insure against aid variability, or purchase an option on nets in advance. If you can’t do either of those things but nets can be stored, then you can literally manufacture the nets in advance and sell them to people who are concerned that net prices may go up, and that’s still a Pareto improvement. If you can’t do that either, then you could lose, but realistically I think rational expectations is the weaker link here :)
I don’t know anything about the particular case of net production. I think that the general argument against aid is similar to the typical argument for protectionism, which I think is something like:
Local production creates local infrastructure, know-how, human capital, etc.
Over the long run this benefits the region much more than it benefits the producers or consumers themselves.
So the state has reason to subsidize local production / tax imports.
If you have usual econ 101 models (including rational expectations), then variability itself doesn’t cause any trouble, the only problem is from these positive externalities. These externalities could be pretty big, it’s plausible to me that they are much larger than the direct benefits to producers and consumers.