Can’t tell if joking, but they probably mean that they were “actually in the mafia” in the game, so not in the real-world mafia.
A better system won’t just magically form itself after the existing system has been destroyed. In all likelihood what will form will be either a far more corrupt and oligarchical system, or no system at all. I think a better target for intervention would be attempting to build superior alternatives so that something is available when the existing systems start to fail. In education for example, Lambda School is providing a better way for many people to learn programming than college.
Note also that existing systems of power are very big, so efforts to damage them probably have low marginal impact. Building initially small new things can have much higher marginal impact. If the systems are as a corrupt as you think they are, they should destroy themselves on their own in any case.
Update—We’re going to have a meetup on September 21st at Uncommon Grounds (1030 South Park Street). This is going to be posted on SlateStarCodex as part of the worldwide meetup event.
No worries, I also missed the earlier posts when I wrote mine. There’s lots of stuff on this website.
I endorse your rephrasing of example 1. I think my position is that it’s just not that hard to create a “self-consistent probability distribution”. For example, say you trained an RNN to predict sequences, like in this post. Despite being very simple, it already implicitly represents a probability distribution over sequences. If you train it with back-propagation on a confusing article involving pyrite, then its weights will be updated to try to model the article better. However, if “pyrite” itself was easy to predict, then the weights that lead to it outputting “pyrite” will *not* be updated. The same thing holds for modern Transformer networks, which predict the next token based only on what it has seen so far. (Here is a paper with a recent example using GPT-2. Note the degeneracy of maximum likelihood sampling, but how this becomes less of a problem when just sampling from the implied distribution)
I agree that this sort of manipulative prediction could be a problem in principle, but it does not seem to occur in recent ML systems. (Although, there are some things which are somewhat like this; the earlier paper I linked and mode collapse do involve neglecting high-entropy components of the distribution. However, the most straightforward generation and training schemes do not incentivize this)
For example 2, the point about gradient descent is this: while it might be the case that outputting “Help I’m stuck in a GPU Factory000” would ultimately result in a higher accuracy, the way the gradient is propagated would not encourage the agent to behave manipulatively. This is because, *locally*, “Help I’m stuck in a GPU Factory” decreases accuracy, so that behavior(or policies leading to it) will be dis-incentivized by gradient descent. It may be the case that this will result in easier predictions later, but the structure of the reward function does not lead to any optimization pressure towards such manipulative strategies. Learning taking place over high-level abstractions doesn’t change anything, because any high-level abstractions leading to locally bad behavior will likewise be dis-incentivized by gradient descent
Example 1 basically seems to be the problem of output diversity in generative models. This can be a problem in generative models, but there are ways around it. e.g. instead of outputting the highest-probability individual sequence, which will certainly look “manipulative” as you say, sample from the implied distribution over sequences. Then the sentence involving “pyrite” will be output with probability proportional to how likely the model thinks “pyrite” is on its own, disregarding subsequent tokens.
For example 2, I wrote a similar post a few months ago (and in fact, this idea seems to have been proposed and forgotten a few times on LW). But for gradient descent-based learning systems, I don’t think the effect described will take place.
The reason is that gradient-descent-based systems are only updated towards what they actually observe. Let’s say we’re training a system to predict EU laws. If it predicts “The EU will pass potato laws...” but sees “The EU will pass corn laws...” the parameters will be updated to make “corn” more likely to have been output than “potato”. There is no explicit global optimization for prediction accuracy.
As you train to convergence, the predictions of the model will attempt to approach a fixed point, a set of predictions that imply themselves. However, due to the local nature of the update, this fixed-point will not be selected to be globally minimal, it will just be the first minima the model falls into. (This is different from the problems with “local minima” you may have heard about in ordinary neural network training—those go away in the infinite-capacity limit, whereas local minima among fixed-points do not) The fixed-point should look something like “what I would predict if I output [what I would predict if I output [what I would predict .. ]]]” where the initial prediction is some random gibberish. This might look pretty weird, but it’s not optimizing for global prediction accuracy.
Hey! It did happen. So far there are 3 of us, we’ve been meeting up pretty regularly. If you’re interested I can let you know the next time we’re planning to meet up.
What this would mean is that we would have to recalibrate our notion of “safe”, as whatever definition has been proved impossible does not match our intuitive perception. We consider lots of stuff we have around now to be reasonably safe, although we don’t have a formal proof of safety for almost anything.
In the mad scientist example, why would your measure for the die landing 0 be 0.91? I think Solomonoff Induction would assign probability 0.1 to that outcome, because you need an extra log2(90) bits to specify which clone you are. Or is this just meant to illustrate a problem with ASSA, UD not included?
Yeah, if you train the algorithm by random sampling, the effect I described will take place. The same thing will happen if you use an RL algorithm to update the parameters instead of an unsupervised learning algorithm(though it seems willfully perverse to do so—you’re throwing away a lot of the structure of the problem by doing this, so training will be much slower)
I also just found an old comment which makes the exact same argument I made here. (Though it now seems to me that argument is not necessarily correct!)
If you literally ran (a powered-up version of) GPT-2 on “A brilliant solution to the AI alignment problem is...” you would get the sort of thing an average internet user would think of as a brilliant solution to the AI alignment problem. Trying to do this more usefully basically leads to Paul’s agenda (which is about trying to do imitation learning of an implicit organization of humans)
Reflective Oracles are a bit of a weird case case because their ‘loss’ is more like a 0⁄1 loss than a log loss, so all of the minima are exactly the same(If we take a sample of 100000 universes to score them, the difference is merely incredibly small instead of 0). I was being a bit glib referencing them in the article; I had in mind something more like a model parameterizing a distribution over outputs, whose only influence on the world is via a random sample from this distribution. I think that such models should in general have fixed points for similar reasons, but am not sure. Regardless, these models will, I believe, favour fixed points whose distributions are easy to compute(But not fixed points with low entropy, that is they will punish logical uncertainty but not intrinsic uncertainy). I’m planning to run some experiments with VAEs and post the results later.
You might be interested in Transformer Networks, which use a learned pattern of attention to route data between layers. They’re pretty popular and have been used in some impressive applications like this very convincing image-synthesis GAN.
re: whether this is a good research direction. The fact that neural networks are highly compressible is very interesting and I too suspect that exploiting this fact could lead to more powerful models. However, if your goal is to increase the chance that AI has a positive impact, then it seems like the relevant thing is how quickly our understanding of how to align AI systems progresses, relative to our understanding of how to build powerful AI systems. As described, this idea sounds like it would be more useful for the latter.
Is there a reason you think a reflective oracle (or equivalent) can’t just be selected “arbitrarily”, and will likely be selected to maximize some score?
The gradient descent is not being done over the reflective oracles, it’s being done over some general computational model like a neural net. Any highly-performing solution will necessarily look like a fixed-point-finding computation of some kind, due to the self-referential nature of the predictions. Then, since this fixed-point-finder is *internal* to the model, it will be optimized for log loss just like everything else in the model.
That is, the global optimization of the model is distinct from whatever internal optimization the fixed-point-finder uses to choose the reflective oracle. The global optimization will favor internal optimizers that produce fixed-points with good score. So while fixed-point-finders in general won’t optimize for anything in particular, the one this model uses will.
I submit Predictors as Agents.
If we assume Sleeping Beauty has lots of information, we might expect that the shortest matching program will look like a simulation of physical law plus a “bridging law” that, given this simulation, tells you what symbols get written to the tape
I agree. I still think that the probabilities would be closer to 1⁄2, 1⁄4, 1⁄4. The bridging law could look like this: search over the universe for compact encodings of my memories so far, then see what is written next onto this encoding. In this case, it would take no more bits to specify waking up on Tuesday, because the memories are identical, in the same format, and just slightly later temporally.
In a naturalized setting, it seems like the tricky part would be getting the AIXI on Monday to care what happens after it goes to sleep. It ‘knows’ that it’s going to lose consciousness(it can see that its current memory encoding is going to be overwritten) so its next prediction is undetermined by its world-model. There is one program that will give it the reward of its successor then terminates, as I described above, but it’s not clear why the AIXI would favour that hypothesis. Maybe if it has been in situations involving memory-wiping before, or has observed other RO-AIXI’s in such situations.
“I can’t make bets on my beliefs about the Eschaton, because they are about the Eschaton.” -- Well, it makes sense. Besides, I did offer you a bet taking into account a) that the money may be worth less in my branch b) I don’t think DL + RL AGI is more likely than not, just plausible. If you’re more than 96% certain there will be no such AI, 20:1 odds are a good deal.
But anyways, I would be fine with betting on a nearer-term challenge. How about—in 5 years, a bipedal robot that can run on rough terrain, as in this video, using a policy learned from scratch by DL + RL(possibly including a simulated environment during training) 1:1 odds.
Hmmm...but if I win the bet then the world may be destroyed, or our environment could change so much the money will become worthless. Would you take 20:1 odds that there won’t be DL+RL-based HLAI in 25 years?
I still don’t see how you’re getting those probabilities. Say it takes 1 bit to describe the outcome of the coin toss, and assume it’s easy to find all the copies of yourself(ie your memories) in different worlds. Then you need:
1 bit to specify if the coin landed heads or tails
If the coin landed tails, you need 1 more bit to specify if it’s Monday or Tuesday.
So AIXI would give these scenarios P(HM)=0.50, P(TM)=0.25, P(TT)=0.25.