If the inner optimizer only affects the world by passing predictions to the outer model, the most obvious trick is to assign artificially negative outcomes to states you want to avoid (e g. which the inner optimizer can predict would update the outer model in bad ways) which then never get checked by virtue of being too scary to try. What are the other obvious hacks?
I guess if you sufficiently control the predictions, you can just throw off the pretense and send the outer model the prediction that giving you a channel to the outside would be a great idea.
Reading this makes me realize what a great game Werewolf One Night is, because the secret information encourages you to lie even when you’re in the majority, which is much more interesting than just one group telling the truth and the other group lying.
What’s the input-output function in the two cases?
Good question :) We need the AI to have a persistent internal representation of the world so that it’s not limited to preferences directly over sensory inputs. Many possible functions would work, and in various places (like comparison to CIRL), I’ve mentioned that it would be really useful to have some properties of a hierarchical probabilistic model, but as an aid to imagination I mostly just thought of a big ol’ RNN.
We want the world model to share associations between words and observations, but we don’t want it to share dynamics (one text-state following another is a very different process from one world-state following another). It might be sufficient for the encoding/decoding functions from observations to be RNNs, and the encoding/decoding functions from text just to be non-recurrent neural networks on patches of text.
That is, if we call the text T the observations (at time t) Ot, and the internal state St, we’d have the encoding function (Ot,St)→St+1, decoding something like (Ot,St)→Ot+1, and also S→T and T→S. And then you could compose these functions to get things like (Ot,St)→Tt+1. Does this answer your question, and do you think it brings new problems to light? I’m more interested in general problems or patterns than in problems specific to RNNs (like initialization of the state), because I’m sort of assuming that this is just a placeholder for future technology that would have a shot at learning a model of the entire world.
For example, I would say that a brain has one world model that is interlinked with speech and vision and action, etc. Right?
Right. I sort of flip-flop on this, also calling it “one simultaneous model” plenty. If there are multiple “models” in here, it’s because different tasks use different subsets of its parts, and if we do training on multiple tasks, those subsets get trained together. But of course the point is that the subsets overlap.
Let me mention my favorite intuition pump against the axiom of choice—the prisoners with infinite hats. For any finite number of prisoners, if they can’t communicate they can’t even do better than chance, let alone saving all but a tiny fraction. But as soon as there are infinitely many, there’s some strange ritual they can do that lets them save all but an infinitely small fraction. This is unreasonable.
The issue is that once you have infinite prisoners you can construct these janky non-measurable sets that aren’t subject to the laws of probability theory. There’s an argument to be made that these are a bigger problem than the axiom of choice—the axiom of choice is just what lets you take the existence of these janky, non-constructive sets and declare that they give you a recipe for saving prisoners.
The class of non-agent AI’s (not choosing actions based on the predicted resulting utility) seems very broad. We could choose actions alphabetically, or use an expert system representing the outside view, or use a biased/inaccurate model when predicting consequences, or include preferences about which actions are good or bad in themselves.
I don’t think there’s any general failure mode (there are certainly specific ones), but if we condition on this AI being selected by humans, maybe we select something that’s doing enough optimization that it will take a highly-optimizing action like rewriting itself to be an agent.
Somehow I missed that second post of yours. I’ll try out the subscribe function :)
Do you also get the feeling that you can sort of see where this is going in advance?
When asking what computations a system instantiates, it seems you’re asking what models (or what fits to an instantiated function) perform surprisingly well, given the amount of information used.
To talk about humans wanting things, you need to locate their “wants.” In the simple case this means knowing in advance which model, or which class of models, you are using. I think there are interesting predictions we can make about taking a known class of models and asking “does one of these do a surprisingly good job at predicting a system in this part of the world including humans?”
The answer is going to be yes, several times over—humans, and human-containing parts of the environment, are pretty predictable systems, at multiple different levels of abstraction. This is true even if you assume there’s some “right” model of humans and you get to start with it, because this model would also be surprisingly effective at predicting e.g. the human+phone system, or humans at slightly lower or higher levels of abstraction. So now you have a problem of underdetermination. What to do? The simple answer is to pick whatever had the highest surprising power, but I think that’s not only simple but also wrong.
Anyhow, since you mention you’re not into hand-coding models of humans where we know where the “wants” are stored, I’d be interested in your thoughts on that step too, since just looking for all computations that humans instantiate is going to return a whole lot of answers.
Obviously if you I the sum, I just want to know the die1-die2? The only problem is that the signed difference looks like a uniform distribution with width dependent on the sum—the signed difference can range from 11 possibilities (-5 to 5) down to 1 (0).
So what I think you do is you put all the differences onto the same scale by constructing a “unitless difference,” which will actually be defined as a uniform distribution.
Rather than having the difference be a single number in a chunk of the number line that changes in size, you construct a big set of ordered points of fixed size equal to the least common multiple of the number of possible differences for all sums. If you think of a difference not as a number, but as a uniform distribution on the set of possible differences, then you can just “scale up” this distribution from its set of variable into the big set of constant size, and sample from this distribution to forget the sum but remember the most information about the difference.
EDIT: I shouldn’t do math while tired.
I know this is becoming my schtick, but have you considered the intentional stance? Specifically, the idea that there is no “the” wants and ontology of e. coli, but that we are ascribing wants and world-modeling to it as a convenient way of thinking about a complicated world, and that different specific models might have advantages and disadvantages with no clear winner.
Because this seems like it has direct predictions about where the meta-strategy can go, and what it’s based on.
But all this said, I don’t think it’s hopeless. But it will require abstraction. There is a tradeoff between predictive accuracy of a model of a physical system, and it including anything worth being called a “value,” and so you must allow agential models of complicated systems to only be able to predict a small amount of information about the system, and maybe even be poor predictors of that.
Consider how your modeling me as an agent gives you some notion of my abstract wants, but gives you only the slimmest help in predicting this text that I’m writing. Evaluated purely as a predictive model, it’s remarkably bad! It’s also based at least as much in nebulous “common sense” as it is in actually observing my behavior.
So if you’re aiming for eventually tinkering with hand-coded agential models of humans, one necessary ingredient is going to be tolerance for abstraction and suboptimal predictive power. And another ingredient is going to be this “common sense,” though maybe you can substitute for that with hand-coding—it might not be impossible, given how simplified our intuitive agential models of humans are.
This is a really cool post, thanks!
Well, if players still have identifying pseudonyms, you could share that you’re being attacked by a certain pseudonym to help you update on whether attacks are real or false, you could try to coordinate to not attack each other if you know your own psudonym. But even without pseudonames, you could share timing information, which might be important.
The earliest correct answer I know of to the question of “how do we have free will?” comes from St. Augustine, except instead of free will vs. determinism it was free will vs. divine omniscience. God knowing the future, Augustine says, doesn’t invalidate our free will, because the cause of the choice still lies within our power, and that’s what matters.
So yeah, sorry, I guess you weren’t interested in talking about whether this makes any sense in relation to free will, but it does seem relevant when something is about 1500 years out of date.
Though for human amplification of quantum noise, check out the work on perception of single photons.
Welcome to LessWrong :) If you have not read the Sequences, particularly A Human’s Guide to Words, I think you might find it really interesting, and has some bearing on this question.
It looks like you get a big advantage from sharing information with another player. In the absence of this, maybe the best strategy is to respond to all alerts with some probability that you think leads to good dynamics if adopted as a general strategy.
Sure. The way it helps is for personal moral indeterminacy—when I want to make a decision, but am aware that, strictly speaking, my values are undefined, I should still do what seems right. A more direct approach to the problem would be Eliezer’s point about type 1 and type 2 calculators.
You know, this isn’t why I usually get called a tool :P
I think I’m saying something pretty different from Nietzsche here. The problem with “Just decide for yourself” as an approach to dealing with moral decisions in novel contexts (like what to do with the whole galaxy) is that, though it may help you choose actions rather than worrying about what’s right, it’s not much help in building an AI.
We certainly can’t tell the AI “Just decide for yourself,” that’s trying to order around the nonexistent ghost in the machine. And while I could say “Do exactly what Charlie would do,” even I wouldn’t want the AI to do that, let alone other people. Nor can we fall back on “Well, designing an AI is an action, therefore I should just pick whatever AI design I feel like, because God is dead and I should just pick actions how I will,” because how I feel like designing an AI has some very exacting requirements—it contains the whole problem in itself.
I forget what got me thinking about this recently, but seeing this branching tree reminded me of something important: when something with many contributions goes well, it’s probably because all the contributions were a bit above average, not because one of them blew the end off the scale. For example, if I learn that the sum of two normal distributions is above average, I expect that contribution to be divided evenly between the two components, in units of standard deviations.
Which is not to say Paul implies anything to the contrary. I was just reminded.
Thought 2 is that in this way of presenting it, I really didn’t see a difference between inner and outer alignment. If you try to teach an AI something, and the concept it learns isn’t the concept you wanted to teach, this is not necessarily inner or outer failure. I’d thought “inner alignment” was normally used in the context of an “inner optimizer” that we might expect to get by applying lots of optimization pressure to a black box.
Hm. You could reframe the “inner alignment” stuff without reference to an inner optimizer by talking about all methods that would work on a heavily optimized black box, but then I think the category becomes broad and includes Paul’s work. But maybe this is for the best? “Transparent box” alignment could still include important problems in designing alignment schemes where the agent is treated as having separable world-models and planning faculties, though given the leakiness of the “agent” abstraction any solution will require “black box” alignment as well.
I imagine it’s like how it’s not at all obvious from the outside what a “producer” does in a play or movie—but after you are part of a production, it becomes clear why one needs a producer.
(Or, if there’s enough money involved that the producer is mainly a financial job, substitute the appropriate name for the person who makes sure things are working.)