“embedded self-justification,” or something like that

Link post


Sometimes I wonder what the MIRI-type crowd thinks about some issue related to their interests. So I go to alignmentforum.org, and quickly get in over my head, lost in a labyrinth of issues I only half understand.

I can never tell whether they’ve never thought about the things I’m thinking about, or whether they sped past them years ago. They do seem very smart, that’s for sure.

But if they have terms for what I’m thinking of, I lack the ability to find those terms among the twists of their mirrored hallways. So I go to tumblr.com, and just start typing.

parable (1/​3)

You’re an “agent” trying to take good actions over time in a physical environment under resource constraints. You know, the usual.

You currently spend a lot of resources doing a particular computation involved in your decision procedure. Your best known algorithm for it is O(N^n) for some n.

You’ve worked on the design of decision algorithms before, and you think this could perhaps be improved. But to find it, you’d have to shift resources some away from running the algorithm for a time, putting them into decision algorithm design instead.

You do this. Almost immediately, you discover an O(N^(n-1)) algorithm. Given the large N you face, this will dramatically improve all your future decisions.

Clearly (…“clearly”?), the choice to invest more in algorithm design was a good one.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

parable (2/​3)

Oh, you’re so very clever! By now you’ve realized you need, above and beyond your regular decision procedure to guide your actions in the outside world, a “meta-decision-procedure” to guide your own decision-procedure-improvement efforts.

Your meta-decision-procedure does require its own resource overhead, but in exchange it tells you when and where to spend resources on R&D. All your algorithms are faster now. Your decisions are better, their guiding approximations less lossy.

All this, from a meta-decision-procedure that’s only a first draft. You frown over the resource overhead it charges, and wonder whether it could be improved.

You try shifting some resources away from “regular decision procedure design” into “meta-decision-procedure-design.” Almost immediately, you come up with a faster and better procedure.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

parable (3/​3)

Oh, you’re so very clever! By now you’ve realized you need, above and beyond your meta-meta-meta-decision-procedure, a “meta-meta-meta-meta-decision-procedure” to guide your meta-meta-meta-decision-procedure-improvement efforts.

Way down on the object level, you have not moved for a very long time, except to occasionally update your meta-meta-meta-meta-rationality blog.

Way down on the object level, a dumb and fast predator eats you.

Could you have anticipated this beforehand? Could you have acted on that knowledge?

the boundary

You’re an “agent” trying to take good actions, et cetera. Your actions are guided by some sort of overall “model” of how things are.

There are, inevitably, two parts to your model: the interior and the boundary.

The interior is everything you treat as fair game for iterative and reflective improvement. For “optimization,” if you want to put it that way. Facts in the interior are subject to rational scrutiny; procedures in the interior have been judged and selected for their quality, using some further procedure.

The boundary is the outmost shell, where resource constraints force the regress to stop. Perhaps you have a target and an optimization procedure. If you haven’t tested the optimization procedure against alternatives, it’s in your boundary. If you have, but you haven’t tested your optimization-procedure-testing-procedure against alternatives, then it’s in your boundary. Et cetera.

You are a business. You do retrospectives on your projects. You’re so very clever, in fact, that you do retrospectives on your retrospective process, to improve it over time. But how do you improve these retro-retros? You don’t. They’re in your boundary.

Of everything you know and do, you trust the boundary the least. You have applied less scrutiny to it than anything else. You suspect it may be shamefully suboptimal, just like the previous boundary, before you pushed it into the interior.

embedded self-justification

You would like to look back on the resources you spend – each second, each joule – and say, “I spent it the right way.” You would like to say, “I have a theory of what it means to decide well, and I applied it, and so I decided well.”

Why did you spend it as you did, then? You cannot answer, ever, without your answer invoking something on the boundary.

How did you spent that second? On looking for a faster algorithm. Why? Because your R&D allocation procedure told you to. Why follow that procedure? Because it’s done better than others in the past. How do you know? Because you’ve compared it to others. Which others? Under what assumptions? Oh, your procedure-experimentation procedure told you. And how do you know it works? Eventually you come to the boundary, and throw up your hands: “I’m doing the best I can, okay!”

If you lived in a simple and transparent world, maybe you could just find the optimal policy once and for all. If you really were literally the bandit among the slot machines – and you knew this, perfectly, with credence 1 – maybe you could solve for the optimal explore/​exploit behavior and then do it.

But your world isn’t like that. You know this, and know that you know it. Even if you could obtain a perfect model of your world and beings like you, you wouldn’t be able to fit it inside your own head, much less run it fast enough to be useful. (If you had a magic amulet, you might be able to fit yourself inside your own head, but you live in reality.)

Instead, you have detailed pictures of specific fragments of the world, in the interior and subject to continuous refinement. And then you have pictures of the picture-making process, and so on. As you go further out, the pictures get coarser and simpler, because their domain of description becomes ever vaster, while your resources remain finite, and you must nourish each level with a portion of those resources before the level above it even becomes thinkable.

At the end, at the boundary, you have the coarsest picture, a sort of cartoon. There is a smiling stick figure, perhaps wearing a lab coat to indicate scientific-rational values. It reaches for the lever of a slot machine, labeled “action,” while peering into a sketch of an oscilloscope, labeled “observations.” A single arrow curls around, pointing from the diagram back into the diagram. It is labeled “optimization,” and decorated with cute little sparkles and hearts, to convey its wonderfulness. The margins of the page are littered with equations, describing the littlest of toy models: bandit problems, Dutch book scenarios, Nash equilibria under perfect information.

In the interior, there are much richer, more beautiful pictures that are otherwise a lot like this one. In the interior, meta-learning algorithms buzz away on a GPU, using the latest and greatest procedures for finding procedures, justified in precise terms in your latest paper. You gesture at a whiteboard as you prioritize options for improving the algorithms. Your prioritization framework has gone through rigorous testing.

Why, in the end, do you do all of it? Because you are the little stick figure in the lab coat.


What am I trying to get at, here?

Occasionally people talk about the relevance of computational complexity issues to AI and its limits. Gwern has a good page on why these concerns can’t place useful bounds on the potential of machine intelligence in the way people sometimes argue they do.

Yet, somehow I feel an unscratched itch when I read arguments like Gwern’s there. They answer the question I think I’m asking when I seek them out, but at the end I feel like I really meant to ask some other question instead.

Given computational constraints, how “superhuman” could an AI be? Well, it could just do what we do, but sped up – that is, it could have the same resource efficiency but more resources per unit time. That’s enough to be scary. It could also find more efficient algorithms and procedures, just as we do in our own research – but it would find them ever faster, more efficiently.

What remains unanswered, though, is whether there is any useful way of talking about doing this (the whole thing, including the self-improvement R&D) well, doing it rationally, as opposed to doing it in a way that simply “seems to work” after the fact.

How would an AI’s own policy for investment in self-improvement compare to our own (to yours, to your society’s)? Could we look at it and say, “this is better”? Could the AI do so? Is there anything better than simply bumbling around in concept-space, in a manner that perhaps has many internal structures of self-justification but is not known to work as a whole? Is there such a thing as (approximate) knowledge about the right way to do all of it that is still small enough to fit inside the agent on which it passes judgment?

Can you represent your overall policy, your outermost strategy-over-strategies considered a response to your entire situation, in a way that is not a cartoon, a way real enough to defend itself?

What is really known about the best way to spend the next unit of resources? I mean, known at the level of the resource-spenders, not as a matter of external judgment? Can anything definite be said about the topic in general except “it is possible to do better or worse, and it is probably possible to do better than we do now?” If not, what standard of rationality do we have left to apply beyond toy models, to ourselves or our successors?