I think you’re misreading Eliezer’s article; even with major advances in neural networks, we don’t have general intelligence, which was the standard that he was holding them to in 2007, not “state of the art on most practical AI applications.” He also stresses the “people outside the field”--to a machine learning specialist, the suggestion “use neural networks” is not nearly enough to go off of. “What kind?” they might ask exasperatedly, or even if you were suggesting “well, why not make it as deep as the actual human cortex?” they might point out the ways in which backpropagation fails to work on that scale, without those defects having an obvious remedy. In the context—the Seeing With Fresh Eyes sequence—it seems pretty clear that it’s about thinking that this is a brilliant new idea as opposed to the thing that lots of people think.
Where’s your impression coming from? [I do agree that Eliezer has been critical of neural networks elsewhere, but I think generally in precise and narrow ways, as opposed to broadly underestimating them.]
The mockery of “neural networks” as the standard “revolutionary AI thing” reads differently today
I think the point being made there is different. For example, the contemporary question is, “how do we improve deep reinforcement learning?” to which the standard answer is “we make it model-based!” (or, I say near-equivalently, “we make it hierarchical!“, since the hierarchy is a broad approach to model embedding). But people don’t know how to do model-based reinforcement learning in a way that works, and the first paper to suggest that was in 1991. If there’s a person whose entire insight is that it needs to be model-based, it makes sense to mock them if they think they’re being bold or original; if there’s a person whose insight is that the right shape of model is XYZ, then they are actually making a bold claim because it could turn out to be wrong, and they might even be original. And this remains true even if 5-10 years from now everyone knows how to make deep RL model-based.
The point is not that the nonconformists were wrong—the revolutionary AI thing was indeed in the class of neural networks—the point is that someone is mistaken if they think that knowing which class the market / culture thinks is “revolutionary” gives them any actual advantage. You might bias towards working on neural network approaches, but so is everyone else; you’re just chasing a fad rather than holding onto a secret, even if the fad turns out to be correct. A secret looks like believing a thing about how to make neural networks work that other people don’t believe, and that thing turning out to be right.
what’s the new feature?
Drafts can now be shared with other users.
Logic is simply a part of physics.
Logic is prior to physics. It could be the case that physics is different; it could not be the case that logic is different. (Put another way, logic occupies a higher level of the Tegmark multiverse, kind of; one can hypothesize a Tegmark V where logic is different. We don’t have a formal model of what “counterlogical reasoning” looks like yet, that is, reasoning about what it would be like if logic were different, whereas we have solid formal models of reasoning about what it would look like if physics were different (either in terms of dynamics or boundary conditions).)
You are saying that CDT don’t understand common causes.
Of an agent’s decisions, because the CDT procedure views actions as interventions, which uproot the relevant node (using the terminology of this paper), that is, delete all of its parents besides the intervention. Observations are distinct from interventions; on observing the weather online, CDT is able to infer about whether or not the grass is wet or dry. On editing the webpage to say that it is raining, CDT does not infer that the grass is wet—which is correct!
I literally just gave arguments about why it’s not the correct action. You repeating this and not countering any of the arguments I brought up doesn’t really help.
Suppose you are building a robot that will face this challenge, and programming what it does in the case where it sees that the box is full. You consider the performance of a one-boxer. It will see the $1M 99% of the time, and take only it, and see only the $1k 1% of the time, and take that. Total expected reward: $990,010.
A two-boxer will see the $1M 1% of the time, and take both, and see only the $1k 99% of the time, and take that. Total expected reward: $11,000.
Since you like money, you program the robot to one-box.
To check, do you think it’s correct to pay the driver in Parfit’s Hitchhiker once you reach town?
If we have an ordering over logical sentences, such that we can look at two sentences and determine at most one of (A is simpler than B), (B is simpler than A), then it seems natural to privilege the counterfactual that keeps the simpler term constant (and likely that this ordering is such that you never have to choose between counterfactuals at the same level of simplicity).
This doesn’t fully solve the problem—now I have a concept of thickness that’s predicated on an ordering, and the ordering is (in some sense) arbitrary for the reasons noted elsewhere (I could define B = A xor C as the ground term, which makes A = B xor C now a composite term). But it seems (to me) like the important thing is being able to build a model that doesn’t allow cyclical behavior at all. Afterwards, one can check to see whether or not the ordering matters (and if so, try to figure out the criteria that make for a good ordering), or view it as arbitrary in approximately the way that axiom sets are arbitrary.
Yep, Transparent Newcomb’s and Parfit’s Hitchhiker are the same problem.
I have a physical definition of causation too. This is why I think CDT 1-boxes. Our universe is causal.
Our universe is also logical, as I’ll explain in a bit. But, importantly, if you think CDT one-boxes then your ‘physical definition of causation’ is different from the definition of causation held by people who think CDT two-boxes.
Consider the twin prisoner dilemma. You and your psychological twin are put into rooms, have to choose whether to cooperate or defect, etc.; you might have the belief in a logical connection between your reasoning and your twin’s reasoning (such that if your reasoning leads you to defect, theirs will as well, and if your reasoning leads you to cooperate, then theirs will as well), but you can’t believe in a physical connection between your reasoning and your twin’s reasoning (this particular voltage in your brain is in the causal history of that particular voltage in their brain). And so if you only reason based on the physical effects that you have on the universe, you end up defecting, because as much as you would like to signal your willingness to cooperate to your twin (and get a guarantee from them) you don’t have the mechanism to do so.
If one has a logical definition of causation (as well as a physical one), then you reason as follows: two calculators, even if physically separated, will get the same answer if they run the same computation. What my decision is doing is working out how a particular computation terminates, so I can think as if I’m choosing both my action and my psychological twin’s action, much like one calculator can expect other calculators will reach the same mathematical result. So reasoning “If I cooperate, then my twin will also cooperate” is valid for the same reasons that “if my calculator says 3*17 is 51, then other calculators will say the same.” [Note that this is actually a different sort of validity than “if I place a ball in a bowl, it will stay there”--if I placed the ball on a hill instead, the ball would roll, but if my calculator miscalculated 3*17 as 37, that wouldn’t change math—and that different sort of validity is why CDT doesn’t respect it.]
If Omega is a perfect predictor, and you are presented the two boxes, well, the only possible answer is yes.
This is not how CDT reasons about its possible actions; it assumes that it can sever all connections to parent nodes whenever it makes a choice. So even in the 100% world, CDT using the causal graphs you provided and would two-box. This is actually a feature. [Thus, the way Omega maintains perfect predictive ability is you never seeing the full box.]
The correct action in transparent Newcomb’s is to one-box when you see the money, even if Omega is only 99% accurate. [Depending on the formulation, it can also be right to one-box when you don’t see the money, but it’s cleaner to assume Omega’s prediction only depends on what you do when you see the money.] Notice that your decision theory does better when it closes its eyes, which seems like a weird feature for a decision theory to have.
It’s possibly a confusion from its name, but I don’t understand why CDT doesn’t 1-box. If it doesn’t, I don’t understand what’s hard about having a decision theory that is causal and 1-boxes.
I was long a proponent of the “CDT one-boxes” position, and was convinced that CDT is a term of art owned by particular philosophers, and that they have a physical definition of causation, and that according to that definition two-boxing is sensible. Basically, you imagine a universe where you take decision A and imagine a universe where you take decision B, making no other changes, and you run the clock forward, and go with the universe that looks better. Omega’s perfect prediction is a time loop where running the clock forward changes the past, and that breaks this decision mechanism.
For example, consider this scenario:
Omega, who can perfectly predict human psychology, determines whether or not Alice will 1-box or 2-box, and sets up her scenario with labeled boxes; it also does the same for you, and sets up your scenario. But due to a scheduling mishap, you get the set of boxes prepared for Alice, and Alice gets the set of boxes prepared for you. In this case, should you one-box or two-box? [Omega made their predictions based off of how you would perform in the standard Newcomb’s problem, not this one, so attempting to fill the box for Alice will not in fact affect whether or not Alice has a full box.]
In this scenario, there doesn’t seem to be a connection between your behavior and Omega’s prediction, and so either the million dollars is there or it isn’t. Taking both boxes is then the obviously sensible thing to do.
Through the mechanism that CDT uses to evaluate futures, this scenario looks the same as standard Newcomb.
Another scenario to consider (transparent Newcomb’s):
Omega, who can perfectly predict your actions, presents you a set of clear boxes. One always contains $1,000, and the other contains $1,000,000 iff you take only that box when both boxes are full. On seeing both boxes full, do you take both boxes?
Here it’s also easy to generate some sympathy for CDT. You can see the money. Abstaining from taking the $1,000 increases the probability of an event that you can already condition on being true. But the FDT solution is to only take the $1M, because that’s the only way to see the $1M in the first place. [An intuition pump here is to imagine the situation playing out in Omega’s imagination, and then how that effects reality, and then because of perfect prediction, you have to behave in reality the way that you want to behave in Omega’s imagination.]
The assignment of probabilities to actions doesn’t influence the final decision here. We just need to assign probabilities to everything. They could be anything, and the decision would come out the same.
Aren’t there meaningful constraints here? If I think it’s equally likely that I’m in L-world and R-world and that this is independent of my action, then I have the constraint that P(Left, L-world)=P(Left, R-world) and another constraint that P(Right, L-world)=P(Right, R-world), and if I haven’t decided yet then I have a constraint that P>0 (since at my present state of knowledge I could take any of the actions). But beyond that, positive linear scalings are irrelevant.
This doesn’t (yet) seem like an argument that alignment is likely difficult. Why should intelligence be shaped like a pyramid? Even if it is, how does alignment depend on the shape of intelligence? Intuitively, if intelligence is shaped like a pyramid, then it’s just really hard to get intelligence, and so we don’t build a superintelligent AI.
Agreed that the rest of the argument is undeveloped in the OP.
First is the argument that animal intelligence is approximately pyramidal in its construction, with neurons serving roles at varying levels of abstraction, and (importantly) layers that are higher up being expressed in terms of neurons at lower layers, in basically the way that neurons in a neural network work.
Alignment can (sort of) be viewed as a correspondence between intelligences. One might analogize this to comparing two programs and trying to figure out if they behave similarly. If the programs are neural networks, we can’t just look at the last layer and see if the parameter weights line up; we have to look at all the parameters, and do some complicated math to see if they happen to be instantiating the same (or sufficiently similar) functions in different ways. For other types of programs, checking that they’re the same is much easier; for example, consider the problem of showing that two formulations of a linear programming problem are equivalent.
I think “really hard” is an overstatement here. It looks like evolution built lizards then mammals then humans by gradually adding on layers, and it seems similarly possible that we could build a very intelligent system out of hooking together lots of subsystems that perform their roles ‘well enough’ but without the sort of meta-level systems that ensure the whole system does what we want it to do. Often, people have an intuition that either the system will fail to do anything at all, or it will do basically what we want, which I think is not true.
Check out Gates’s April 2018 speech on the subject. Main takeaway: bednets started becoming less effective in 2016, and they’re looking at different solutions, including gene drives to wipe out mosquitoes, which is a solution unlikely to require as much maintenance as bed nets.
Like causality, intention & agency seems to me intensely tied up with an incomplete and coarse-grained model of the world.
This seems right to me; there’s probably a deep connection between multi-level world models and causality / choices / counterfactuals.
We cannot ask a rock to consider hypothetical scenarios. Neither can we ask an ant to do so.
This seems unclear to me. If I reduce intelligence to circuitry, it looks like the rock is the null circuit that does no information processing, the ant is a simple circuit that does some simple processing, and a human is an very complex circuit that does very complex processing. The rock has no sensors to vary, but the ant does, and thus we could investigate a meaningful counterfactual universe where the ant would behave differently were it presented with different stimuli.
Is the important thing here that the circuitry instantiate some consideration of counterfactual universes in the factual universe? I don’t know enough about ant biology to know whether or not they can ‘imagine’ things in the right sense, but if we consider the simplest circuit that I view as having some measure of ‘intelligence’ or ‘optimization power’ or whatever, the thermostat, it’s clear that the thermostat isn’t doing this sort of counterfactual reasoning (it simply detects whether it’s in state A or B and activates an actuator accordingly).
If so, this looks like trying to ground out ‘what are counterfactuals?’ in terms of the psychology of reasoning: it feels to me like I could have chosen to get something to drink or keep typing, and the interesting thing is where that feeling comes from (and what role it serves and so on). Maybe another way to think of this is something like “what are hypotheticals?“: when I consider a theorem, it seems like the theorem could be true or false, and the process of building out those internal worlds until one collapses is potentially quite different from the standard presentation of a world of Bayesian updating. Similarly, when I consider my behavior, it seems like I could take many actions, and then eventually some actions happen. Even if I never take action A (and never would have, for various low-level deterministic reasons), it’s still part of my hypothetical space, as considered in the real universe. Here, ‘actions I could take’ has some real instantiation, as ‘hypotheticals I’m considering implicitly or explicitly’, complete with my confusions about those actions (“oh, turns out that action was ‘choke on water’ instead of ‘drink water’. Oops.“), as opposed to some Platonic set of possible actions, and the thermostat that isn’t considering hypotheticals is rightly viewed as having ‘no actions’ even tho it’s more reactive than a rock.
This seems promising, but collides with one of the major obstacles I have in thinking about embedded agency; it seems like the descriptive problem of “how am I doing hypothetical reasoning?” is vaguely detached from the prescriptive question of “how should I be doing hypothetical reasoning?” or the idealized question of “what are counterfactuals?“. It’s not obvious that we have an idealized view of ‘set of possible actions’ to approximate, and if we build up from my present reasoning processes, it seems likely that there will be some sort of ontological shift corresponding to an upgrade that might break lots of important guarantees. That said, this may be the best we have to work with.
How does this experiment distinguish between the rock and the ant? It seems to me like we can establish lots of times that no matter where we start the rock out in a bowl, it acts as though it wants to follow a particular trajectory towards the bottom of the bowl. [We could falsify the hypothesis that it wants to reach the bottom of the bowl as quickly as possible, as it doesn’t do rush there and then stop, but surely we aren’t penalizing the rock for having a complicated goal.]
It seems like you’re trying to rule out the rock’s agency through the size of its action set:
At each round we assign more and more agency to R. Rather, than a binary ‘Yes, R has agency’ or ‘No, R has no agency’ we imagine a continuum going from a rock, which has no possible actions, to an ant, which might pass some of the tests but not all, to humans and beyond.
This procedure doesn’t establish what actions an agent has access to, just what actions they do in fact take, and whether or not those actions line up with our model of rational choice for particular goals. I don’t see how this distinguishes from a rock that could either float or roll down the bowl and decides to roll down the bowl, and a rock that can only roll down the bowl.
[It seems to me that we actually determine such things through our understanding of physics and machinery; we look at a rock and don’t see any actuators that would allow it to float, and thus infer that the rock didn’t have that option. Or we have some ‘baseline matter’ that we consider as ‘passive’ in that it isn’t doing anything the reflects agency or action, even though it does include potentially dramatic changes just through the normal update laws of physics, and then we have other matter that we consider as ‘active’ because it behaves quite differently from the ‘passive’ matter, using the same normal update laws of physics. But this gets quite troublesome if you look at it too hard.]
The core feature here seems to be that the agent has some ability to refer to itself, and that this localization differs between instantiations. Alice optimizes for dollars in her wallet, Bob optimizes for dollars in his wallet, and so they end up fighting over dollars despite being clones, because the cloning procedure doesn’t result in arrows pointing at the same wallet.
It seems sensible to me to refer to this as the ‘exact same source code,’ but it’s not obvious to me how you would create these sort of conflicts without that sort of different resolution of pointers, and so it’s not clear how far this argument can be extended.
The point you raised, that “expected number of aliens is high vs. substantial probability of no aliens” is an explanation of why people were confused.
Right, I think it’s important to separate out the “argument for X” and the “dissolving confusions around X” as the two have different purposes.
I’m making this comment because if I’m right it means that we only need to look for people (like me?) who were saying all along “there is no Fermi paradox because abiogenesis is cosmically rare”, and figure out why no one listened to them.
I think the important thing here is the difference between saying “abiogenesis is rare” (as an observation) and “we should expect that abiogenesis might be rare” (as a prediction) and “your own parameters, taken seriously, imply that we should expect that abiogenesis might be rare” (as a computation). I am not aware of papers that did the third before this, and I think most claims of the second form were heard as “the expected number of aliens is low” (which is hard to construct without fudging) as opposed to “the probability of no aliens is not tiny.”
However, I expect self-modification to more naturally emerge out of a general reasoning AI that can understand its own composition and how the parts fit into the whole, and have “thoughts” of the form “Hmm, if I change this part of myself, it will change my behavior, which might compromise my ability to fix issues, so I better be _very careful_, and try this out on a copy of me in a sandbox”.
This does seem like a double crux; my sense is that correctly reasoning about self-modification requires a potentially complicated theory that I don’t expect a general reasoning to realize it needs as soon as it becomes capable of self-modification (or creating successor agents, which I think is a subproblem of self-modification). It seems likely that it could be in a situation like some of the Cake or Death problems, where it views a change to itself as impacting only part of its future behavior (like affecting actions but not values, such that it suspects that a future it that took path A would be disappointed in itself and fix that bug, without realizing that the change it’s making will cause future it to not be disappointed by path A), or is simply not able to foresee the impacts of its changes and so makes them ‘recklessly’ (in the sense that every particular change seems worth it, even if the policy of making changes at that threshold of certainty seems likely to lead to disaster).
If you actually had only one of those and not the other, you would notice _really fast_, so it’s not going to harm you.
The thing I’m worried about is fixing only one of them—see Reason as Memetic Immune Disorder.
Under your view, it seems like AI researchers are going to add a self-modification routine to the AI, which can unilaterally rewrite the source code of the AI as it wants . Under my view, AI researchers don’t really think much about self-modification, and just build an AI system capable of learning and performing general tasks, one of which could be the task of improving the AI system with very high confidence that the proposed improvement will work.
I think the current standard approach is unilateral modifications (what checks do we put on gradient descent modifiying parameter values?), and that this is unlikely to change as AI researchers figure out how to do bolder and bolder variations. How would you classify the meta-learning approaches under development?
I think it’s likely that there will be some safeguards in place, much in the way that you don’t get robust multicellular life without some mechanisms of correcting cancers when they develop. The root of my worry here is that I don’t expect this problem to be solved well if researchers aren’t thinking much about self-modification (and thus how to solve it well).
Do you generally trust that you personally could be handed the key to human self-modification?
I think this depends a lot on how the key is shaped. If I can write rules for moving around cells in my body, or modifying the properties of those cells, probably not, because I don’t have enough transparency for the consequences. If I have a dial with my IQ on it, probably, or if I have a set of dials related to the strength of various motivations, probably, but here I would still feel like there are significant risks associated with moving outside normal bounds that I would be accepting because we live in weird times. [For example, it seems likely that some genes that increase intelligence also increase brain cancer risk, and it seems possible that ‘turning the IQ dial’ with this key would similarly increase my chance of having brain cancer.]
Similarly, being able to print the genome for potential children rather than rolling randomly or selecting from a few options seems like it would be useful and I would use it, but is not making the situation significantly safer and could easily lead to systematic problems because of correlated choices.
Keep in mind that the overseer (two steps forward) is always far more powerful than the agent we’re distilling (one step back), is trained to not Goodhart, is training the new agent to not Goodhart (this is largely my interpretation of what corrigibility gets you), and is explicitly searching for ways in which the new agent may want to Goodhart.
Well, but Goodhart lurks in the soul of all of us; the question here is something like “what needs to be true about the overseer such that it does not Goodhart (and can recognize it in others)?”
Not quite. The mean number of aliens to see is basically unchanged—the main claim that the paper is making is that a very high probability of 0 aliens is consistent with uncertainty ranges that people have already expressed, and thus with the high mean number of aliens that people would have expected to see before observations.