Maybe optimality relative to the best performer out of some class of algorithms that doesn’t include “just pick the absolute best answer?” You basically prove that in environments with traps, anything that would, absent traps, be guaranteed to find the absolute best answer will instead get trapped. So those aren’t actually very good performers.
I just can’t come up with anything too clever, though, because the obvious classes of algorithms, like “polynomial time,” include the ability to just pick the absolute best answer by luck.
The former (that is, model-based RL-> agent). The latter (smart agent → model-based RL), I think, would be founded on a bit of a level error. At bottom, there are only atoms and the void. Whether something is “really” an agent is a question of how well we can describe this collection of atoms in terms of an agent-shaped model. This is different from the question of what abstractions humans used in the process of programming the AI; Like Rohin says, parts of the agent might be thought of as implicit in the programming, rather than explicit.
Sorry, I don’t know if I can direct you to any explicit sources. If you check out papers like Concrete Problems in AI Safety or others in that genre, though, you’ll see model-based RL used as a simplifying set of assumptions that imply agency.
It seems like the upshot is that even weak optimality is too strong, since it has to try everything once. How does one make even weaker guarantees of good behavior that are useful in proving things, without just defaulting to expected utility maximization?
Yup, I’m pretty sure people are aware of this :) See also the model of an agent as something with preferences, beliefs, available actions, and a search+decision algorithm that makes it take actions it believes will help its preferences.
But future AI research will require some serious generalizations that are left un-generalized in current methods. A simple gridworld problem might treat the entire grid as a known POMDP and do search over possible series of actions. Obviously the real world isn’t a known POMDP, so suppose that we just call it an unknown POMDP and try to learn it through observation—now all of a sudden, you can’t hand-specify a cost function in terms of the world model anymore, so that needs to be re-evaluated as well.
Obviously I have much less information about your situation than you, but it seems to me that you’re not in the right here, and you should be less adversarial.
Yesterday, a teacher was explaining the composition of a literary essay, and she claimed that an essay writer isn’t required to provide justification for their claims. I asked, “Then why should I believe anything the essay says?” and she replied, “You’re free to decide whether you believe it or not,” and I was just too exhausted from last week to explain that’s not how beliefs should work.
But an essay writer isn’t required to provide justification for their claims. For example, your first sentence is a claim that you have started studying creative writing full time. Have you justified this to me (either in some unattainable absolute sense, or even just beyond reasonable doubt)? No. Should you? Also no.
When you run into an absurd claim like “An essay writer isn’t required to provide justifications for their claims,” you should think seriously about how it might be true. I think you’re only going to be satisfied by understanding communication on a more detailed level than your professors do, but you should do that, not just reject what they say.
Back to the object level: Why should I believe you when you claim that you’ve started studying creative writing? I do believe you, of course—but practically speaking, why do I do that? Try to figure out an answer that generalizes by being based on the practicalities of how humans communicate and infer things about the world. And then apply that answer back to what sort of evidence an essay writer needs to provide to their audience to do their job well.
I also think you’re trying to use arguments in ways that won’t work. Robert Nozick makes some clever comments about arguments in best part of his book Philosophical Explanations (the introduction), something like: The goal of most philosophers seems to be to find arguments so compelling, that if a person were to disagree with the conclusion after reading the argument, their head would explode.
The terminology of philosophical art is coercive: arguments are powerful and best when they are knockdown, arguments force you to a conclusion, if you believe the premises you have to or must believe the conclusion, some arguments do not carry much punch, and so forth. A philosophical argument is an attempt to get someone to believe something, whether he wants to believe it or not. A successful philosophical argument, a strong argument, forces someone to a belief.
Though philosophy is carried on as a coercive activity, the penalty philosophers wield is, after all, rather weak. If the other person is willing to bear the label of “irrational” or “having the worse arguments,” he can skip away happily maintaining his previous belief. He will be trailed, of course, by the philosopher furiously hurling philosophical imprecations: “What do you mean, you’re willing to be irrational? You shouldn’t be irrational because...” And although the philosopher is embarrassed by his inability to complete this sentence in a noncircular fashion—he can only produce reasons for accepting reasons—still, he is unwilling to let his adversary go.
Wouldn’t it be better if philosophical arguments left the person no possible answer at all, reducing him to impotent silence? Even then, he might sit there silently, smiling, Buddhalike. Perhaps philosophers need arguments so powerful they set up reverberations in the brain: if the person refuses to accept the conclusion, he dies. How’s that for a powerful argument. Yet, as with other physical threats (“your money or your life”), he can choose defiance. A “perfect” philosophical argument would leave no choice.
But the point of that chapter is that such arguments don’t exist. They’re an oversimplification of how arguments work. A fiction. Much like an essay, an argument with a professor is an exercise in communication, not in structuring a coercive argument.
So if nothing else, I think taking a more learning-oriented approach towards your professors might make you more likely to be able to convince them of things :)
Reflective modification flow: Suppose we have an EDT agent that can take an action to modify its decision theory. It will try to choose based on the average outcome conditioned on taking the different decision. In some circumstances, EDT agents are doing well so it will expect to do well by not changing; in other circumstances, maybe it expects to do better conditional on self-modifying to use the Counterfactual Perspective more.
Evolutionary flow: If you put a mixture of EDT and FDT agents in an evolutionary competition where they’re playing some iterated game and high scorers get to reproduce, what does the population look like at large times, for different games and starting populations?
It seems like if I’m trying to talk about a real-world case with finite support, I’ll say something like “it’s not actually a power law—but it’s well described by one over the relevant range of values.” Meaning that I have some notion of “relevant” which is probably derived from action-relevance, or relevance to my observations, or maybe computational complexity.
If I can’t say that, then the other main option is that I care more and more as the power law gets more extreme, and then as the possibilities reach their physical limit I care most of all. But cases like this are so idiosyncratic that maybe there’s no point in trying to develop a unified language for them.
I usefully demonstrated rationality superpowers yesterday by bringing a power strip to a group project with limited power outlets.
Now, you could try to grind this ability by playing improv games with the situations around you, looking for affordances, needs, and solutions. But this is only a sub-skill, and I think most of my utility comes from things that are more like mindset technology.
A personal analogy: If I want to learn the notes of a tune on the flute, it works fine to just play it repeatedly—highly grindable. If I want to make that tune sound better, this is harder to grind but still doable; it involves more skillful listening to others, listening to yourself, talking, trial and error. If I want to improve the skills I use to make tunes sound better, I can make lots of tunes sound better, but less of my skill is coming from grinding now, because that accumulation is slower than other methods of learning. And if I want to improve my ability to learn the skills used in making tunes sound better...
Well, first off, that’s a rationality-adjacent skill, innit? But second, grinding that is so slow, and so stochastic, that it’s hard to distinguish from just living my life, but just happening to try to learn things, and accepting that I might learn one-time things that obviate a lot of grinding.
So maybe the real grinding was bringing the power strip all along.
How much are you thinking about stability under optimization? Most objective catastrophes are also human catastrophes. But if a powerful agent is trying to achieve some goal while avoiding objective catastrophes, it seems like it’s still incentivized to dethrone humans—to cause basically the most human-catastrophic thing that’s not objective-catastrophic.
I’m definitely satisfied with this kind of content.
The names suggest you’re classifying decision procedures by what kind of thoughts they have in special cases. But “sneakily” the point is this is relevant because these are the kinds of thoughts they have all the time.
I think the next place to go is to put this in the context of methods of choosing decision theories—the big ones being reflective modification and evolutionary/population level change. Pretty generally it seems like the trivial perspective is unstable is under these, but there are some circumstances where it’s not.
Thank you for putting all the time and thoughtfulness into this post, even if the conclusion is “nope, doesn’t pan out.” I’m grateful that it’s out here.
I think it’s mostly (3). Not because AI safety is an outlier, but because of how much work people had to do to come to grips with Moravec’s paradox.
If you take someone clever and throw them at the problem of GAI, the first thing they’ll think of is something doing logical reasoning, able to follow natural language commands. Their intuition will be based on giving orders to a human. It takes a lot of work to supplant that intuition with something more mechanistic.
Like, it seems obvious to us now that building something that takes natural language commands and actually does what we mean is a very hard problem. But this is exactly a Moravec’s paradox situation, because knowing what people mean is mostly effortless and unconscious to us.
Hey, thanks for this congenial reply to my fairly rude comment :)
So, I bring up the military thing because of a roommate of mine, but if I google “military posture tips,” I get this page, which basically says that if you’re hunched forward, you need to stretch the muscles causing that force, and exercise the muscles that naturally oppose it. In short, get a stronger upper and lower back! They also give specific recommendations (albeit mostly geared towards body-weight exercises easy for a home reader to do).
I really love the level of detail in this sketch!
I’m mentally substituting continuet for some question more like “should this debate continue?”, because I think the setup you describe keeps going until Amp is satisfied with an answer, which might be never for weak M. It’s also not obvious to me that this reward system you describe actually teaches agents to debate between odd and even steps. If there’s a right answer that the judge might be convinced of, I think M will be trained to give it no matter the step parity, because when that happens it gets rewarded.
Really, it feels like the state of the debate is more like the state of a RNN, and you’re going to end up training something that can make use of that state to do a good job ending debates and making the human response be similar to the model response.
Thanks! This is an interesting recommendation.
I was definitely struck by the resemblance between her notion of “normative dependence” and the ideas behind the CIRL framework. And I think that the fix to the AI reasoning about something more intelligent is more or less the same thing humans do, which is we abstract away the planning and replace it with some “power” to do something. Like if I imagine playing Magnus Carlsen in chess, I don’t simulate a chess game at all, I compare an imaginary chess-winning power I attribute to us in my abstracted mental representation.
But as for the philosophical problems she mentions in the interview, I felt like they fell into pretty standard orthodox philosophical failure modes. For the sake of clarity, I guess I should say I mean the obsession with history, and the default assumption that questions have one right answer—things I think are boondoggles have to be addressed just because they’re historical, and there’s too much worry about what humans “really” are like as opposed to consideration of models of humans.
You have an entire copy of the post in the commenting guidelines, fyi :)
What’s often going on in unresolvable debates among humans is that there is a vague definition baked into the question, such that there is no “really” right answer (or too many right answers).
E.g. “Are viruses alive?”
To the extent that we’ve dealt with the question of whether viruses are alive, it’s been by understanding the complications and letting go of the need for the categorical thinking that generated the question in the first place. Allowing this as an option seems like it brings back down the complexity class of things you can resolve debates on (though if you count “it’s a tie” as a resolution, you might retain the ability to ask questions in PSPACE but just have lots of uninformative ties and only update your own worldview when it’s super easy).
For questions of value, though, this approach might not even always work, because the question might be “is it right to take action A or action B,” and even if you step back from the category “right” because it’s too vague, you still have to choose between action A or B. But you still have the original issue that the question has too few / too many right answers. Any thoughts on ways to make debate do work on this sort of tricky problem?
Weird question, why bother mentally mapping two different ends of the same bone (sternum)? In fact, why all this trivia about bone knobs in the first place? If I want good posture, I’d be better off learning the lessons of the military, and if I want to relax, why bone knobs?
It’s fine, people are only 1 layer of unreality removed from money, so they can interact via gravity, which “leaks” into the 4th dimension (explaining why it’s so much weaker than the electromagnetic force).
When you say the human decision procedure causes human values, what I hear is that the human decision procedure (and its surrounding way of describing the world) is more ontologically basic than human values (and their surrounding way if describing the world).
Our decision procedure is “the reason for our values” in the same way that the motion of electric charge in your computer is the reason it plays videogames (even though “the electric charge is moving” and “it’s playing a game” might be describing the same physical event). The arrow between them isn’t the most typical causal arrow between two peers in a singular way of describing the world, it’s an arrow of reduction/emergence, between things at different levels of abstraction.