I actually think this is pretty wrong (posts forthcoming, but see here for the starting point). You make a separation between the modeled human values and the real human values, but “real human values” are a theoretical abstraction, not a basic part of the world. In other words, real human values were always a subset of modeled human values.
In the example of designing a transit system, there is an unusually straightforward division between things that actually make the transit system good (by concise human-free metrics like reliability or travel time), and things that make human evaluators wrongly think it’s good. But there’s not such a concise human-free way to write down general human values.
The pitfall of optimization here happens when the AI is searching for an output that has a specific effect on humans. If you can’t remove the fact that there is a model of humans involved, then the AI has to be evaluating its output in some other way than modeling the human’s reaction to it.
Awesome idea! I think there might be something here, but I think the difference between “no chance” and “0.01% chance” is more of a discrete change from not tracking something to tracking it. We might also expect neglect of “one in a million” vs “one in a trillion” in both updates and decision-making, which causes a mistake opposite that predicted by this model in the case of decision-making.
About 95%. Because philosophy is easy* and full of obvious confusions.
(* After all, anyone can do it well enough that they can’t see their own mistakes. And with a little more effort, you can’t even see your mistakes when they’re pointed out to you. That’s, like, the definition of easy, right?)
95% isn’t all that high a confidence, if we put aside “how dare you rate yourself so highly?” type arguments for a bit. I wouldn’t trust a parachute that had a 95% chance of opening. Most of the remaining 5% is not dualism being true or us needing a new kind of science, it’s just me having misunderstood something important.
Anyhow, I agree that we have long since been rehashing standard arguments here :P
Seeing red is more than a role or disposition. That is what you have left out.
Seeing red is more than a role or disposition. That is what you have left out.
Suppose epiphenomenalism is true. We would still need two separate explanations—one explanation of your epiphenomenal activity in terms of made-up epiphenomenology, and a different explanation for how your physical body thinks it’s really seeing red and types up these arguments on LessWrong, despite having no access to your epiphenomena.
The mere existence of that second explanation makes it wrong to have absolute confidence in your own epiphenomenal access. After all, we’ve just described approximate agents that think they have epiphenomenal access, and type and make facial expressions and release hormones as if they do, without needing any epiphenomena at all.
We can imagine the approximate agent made out of atoms, and imagine just what sort of mistake it’s making when it says “no, really, I see red in a special nonphysical way that you have yet to explain” even when it doesn’t have access to the epihpenomena. And then we can endeavor not to make that mistake.
If I, the person typing these words, can Really See Redness in a way that is independent or additional to a causal explanation of my thoughts and actions, my only honest course of action is to admit that I don’t know about it.
I’m supposing that we’re conceptualizing people using a model that has internal states. “Agency” of humans is shorthand for “conforms to some complicated psychological model.”
I agree that I do see red. That is to say, the collection of atoms that is my body enters a state that plays the same role in the real world as “seeing red” plays in the folk-psychological model of me. If seeing red makes the psychological model more likely to remember camping as a child, exposure to a red stimulus makes the atoms more likely to go into a state that corresponds to remembering camping.
“No, no,” you say. “That’s not what seeing red is—you’re still disagreeing with me. I don’t mean that my atoms are merely in a correspondence with some state in an approximate model that I use to think about humans, I mean that I am actually in some difficult to describe state that actually has parts like the parts of that model.”
“Yes,” I say “—you’re definitely in a state that corresponds to the model.”
“Arrgh, no! I mean when I see red, I really see it!”
“When I see red, I really see it too.”
It might at this point be good for me to reiterate my claim from the post, that rather than taking things in our notional world and asking “what is the true essence of this thing?“, it’s more philosophically productive to ask “what approximate model of the world has this thing as a basic object?”
Then the thought experiment is a useful negative result telling us we need something more comprehensive.
Then the thought experiment is a useful negative result telling us we need something more comprehensive.
Paradigms also outline which negative results are merely noise :P I know it’s not nice to pick on people, but look at the negative utilitarians. They’re perfectly nice people, they just kept looking for The Answer until they found something they could see no refutation of, and look where that got them.
I’m not absolutely against thought experiments, but I think that high-energy philosophy as a research methodology is deeply flawed.
Suppose that we show how certain physical processes play the role of qualia within an abstract model of human behavior. “This pattern of neural activities means we should think of this person as seeing the color red,” for instance.
David Chalmers might then say that we have merely solved an “easy problem,” and that what’s missing is whether we can predict that this person—this actual first-person point of view—is actually seeing red.
This is close to what I parody as “Human physical bodies are only approximate agents, so how does this generate the real Platonic agent I know I am inside?”
When I think of myself as an abstract agent in the abstract state of “seeing red,” this is not proof that I am actually an abstract Platonic Agent in the abstract state of seeing red. The person in the parody has been misled by their model of themselves—they model themselves as a real Platonic agent, and so they believe that’s what they have to be.
Once we have described the behavior of the approximate agents that are humans, we don’t need to go on to describe the state of the actual agents hiding inside the humans.
Also replying to:
I am not clear how you are defining HEphil: do you mean (1) that any quest for the ontologically basic is HEphil, or (2) treating mental properties as physical is the only thing that is HEphil ?
Neither of those things is quite what I meant—sorry if I was unclear. The quest for the ontologically basic is what I call “thinking you’re like a particle physicist,” (not inherently bad, but I make the claim that when done to mental objects it’s pretty reliably bad). This is distinct from “high energy philosophy,” which I’m trying to use in a similar way to Scott.
High Energy Philosophy is the idea that extreme thought experiments help illuminate what we “really think” about things—that our ordinary low-energy thoughts are too cluttered and dull, but that we can draw out our intuitions with the right thought experiment.
I argue that this is a dangerous line of thought because it’s assuming that there exists some “what we really think” that we are uncovering. But what if we’re thinking using an approximation that doesn’t extend to all possible situations? Then asking what we really think about extreme situations is a wrong question.
[Even worse is when people ignore the fact that the concept is a human invention at all, and try to understand “the true nature of belief” (not just what we think about belief) by conceptual analysis.]
So, now, back the the question of “the correct ethical theory.” What, one might ask, is the correct ethical theory that captures what we really value in all possible physical situations (i.e. “extends to high energy”)?
Well, one can ask that, but maybe it doesn’t have an answer. Maybe, in fact, there is no such object as “what we really value in all possible physical situations”—it might be convenient to pretend there is in order to predict humans using a simple model, but we shouldn’t try to push that model too far.
(EDIT: Thanks for asking me these questions / pressing me on these points, by the way.)
Seems interesting, thanks!
I definitely think machine learning topics are useful. Given that there’s so much stuff out there and you can only cover a small fraction of it, maybe recent machine learning topics are a point of comparative advantage, even. The best textbook on set theory is probably pretty good already.
Another service that could take advantage of pre-existing textbooks is short summaries, designed to give people just enough of a taste to make an informed decision about reading said good textbook. Probably easier than developing a course on algorithmic information theory, or circuit complexity, or whatever.
I think the way to make sense of this (And e.g. surveys that ask this question) might be tautological. “It’s 0-10 on whatever opaque process I use to answer this question.”
This makes the absolute number nearly meaningless, though given human habits you can probably figure out approximate emotional valences of 0, 1-3, 4-5, 6-9, and 10. But depending on how stable the average person’s opaque mapping of emotional state to number is, it might still yield really interesting cross-time and cross-population comparisons.
In the rocket example, procedures A and B can both be optimized either by random sampling or by local search. A is optimizing some hand-coded rocket specifications, while B is optimizing a complicated human approval model.
The problem with A is that it relies on human hand-coding. If we put in the wrong specifications, and the output is extremely optimized, there are two possible cases: we recognize that this rocket wouldn’t work and we don’t approve it, or we think that it looks good but are probably wrong, and the rocket doesn’t work.
On the upside, if we successfully hand-coded in how a rocket should be, it will output working rockets.
The problem with B is that it’s simply the wrong thing to optimize if you want a working rocket. And because it’s modeling the environment and trying to find an output that makes the environment-model do something specific, you’ll get bad agent-like behavior.
Let’s go back to take a closer look at case A. Suppose you have the wrong rocket specifications, but they’re “pretty close” in some sense. Maybe the most spec-friendly rocket doesn’t function, but the top 0.01% of designs by the program are mostly in the top 1% of rockets ranked by your approval.
The programmed goal is proxy #1. Then you look at some of the sampled (either randomly or through local search) top 0.01% designs for something you think will fly. Your approval is proxy #2. Your goal is the rocket working well.
What you’re really hoping for in designing this system is that even if proxy #1 and proxy #2 are both misaligned, their overlap or product is more aligned—more likely to produce an actual working rocket—than either alone.
This makes sense, especially under the model of proxies as “true value + noise,” but to the extent that model is violated maybe this doesn’t work out.
This is another way of seeing what’s wrong with case B. Case B just purely optimizes proxy #2, when the whole point of having human approval is to try to combine human approval with some different proxy to get better results.
As for local search vs. random sampling, this is a question about the landscape of your optimized proxy, and how this compares to the true value—neither way is going to be better literally 100% of the time.
If we imagine local optimization like water flowing downhill in the U.S., given a random starting point, the water is much more likely to end up at the mouth of the Mississippi river than it is to end up in Death Valley, even though Death Valley is below sea level. The Mississippi just has a broad network of similar states that lead into it via local optimization, whereas Death Valley is a “surprising” optimum. Under random sampling, you’re equally likely to find equal areas of the mouth of the Mississippi or Death Valley.
Applying this to rockets, I would actually expect local search to produce much safer results in case B. Working rockets probably have broad basins of similar almost-working rockets that feed into them in configuation-space, whereas the rocket that spells out a message to the experimenter is quite a bit more fragile to perturbations.
(Even if rockets are so complicated and finnicky that we expect almost-working rockets to be rarer than convincing messages to the experimenter, we still might think that the gradient landscape makes gradient descent relatively better.)
In case A, I would expect much less difference between locally optimizing proxy #1 and sampling until it was satisfied. The difference for human approval came because we specifically didn’t want to find the unstable, surprising maxima of human approval. And maybe the same is true of our hand-coded rocket specifications, but I would expect this to be less important.
It was using micro effectively, but the crazy 1200+ APM fight was pretty unusual. If you look at most of its fights (e.g. https://youtu.be/H3MCb4W7-kM?t=2854 , APM number appears intermittently in the center-bottom ), with 6-10 units, it’s using about the same APM as the human—the micro advantage for 98% of the game isn’t because it’s clicking faster, its clicks are just better.
There were a bunch of mistakes in the first matches shown, but when they trained for twice as long it seemed like those mistakes mostly went away, and its macro play seemed within the range of skilled humans (if you’re willing to suspect that overbuilding probes might be good).
Well, it definitely may have had an advantage that embodied humans can’t have. “Does perfect stalker micro really count as intelligence?“, we wail. But you have to remember that previous starcraft bots playing with completely unrestricted apm weren’t even close to competitive level. I think that the evidence is pretty strong that AlphaStar (at least the version without attention that just perceived the whole map) could beat humans under whatever symmetric APM cap you want.
I want to like the AI alignment podcast, but I feel like Lucas is over-emphasizing thinking new thoughts and asking challenging questions. When this goes wrong (it felt like ~30% of this episode, but maybe this is a misrecollection), it ends up taking too much effort to understand for not enough payoff.
There are sometimes some big words and abstruse concepts dragged in, just to talk about some off-the-cuff question and then soon jump to another. But I don’t think that’s what I want as a member of the audience. I’d prefer a gentler approach that used smaller words and broke down the abstruse concepts more, and generated interest by delving into interesting parts of “obvious” questions, rather than jumping to new questions.
In short, I think I’d prefer a more curiosity-based interviewing style—I like it most when he’s asking the guests what they think, and why they think that, and what they think is important. I don’t know if you (dear reader) have checked out Sean Carroll’s podcast, but his style is sort of an extreme of this.
Any time you have a search process (and, let’s be real, most of the things we think of as “smart” are search problems), you are setting a target but not specifying how to get there. I think the important sense of the word “agent” in this context is that it’s a process that searches for an output based on the modeled consequences of that output.
For example, if you want to colonize the upper atmosphere of Venus, one approach is to make an AI that evaluates outputs (e.g. text outputs of persuasive arguments and technical proposals) based on some combined metric of how much Venus gets colonized and how much it costs. Because it evaluates outputs based on their consequences, it’s going to act like an agent that wants to pursue its utility function at the expense of everything else.
Call the above output “the plan”—you can make a “tool AI” that still outputs the plan without being an agent!
Just make it so that the plan is merely part of the output—the rest is composed according to some subprogram that humans have designed for elucidating the reasons the AI chose that output (call this the “explanation”). The AI predicts the results as if its output was only the plan, but what humans see is both the plan and the explanation, so it’s no longer fulfilling the criterion for agency above.
In this example, the plan is a bad idea in both cases—the thing you programmed the AI to search for is probably something that’s bad for humanity when taken to an extreme. It’s just that in the “tool AI” case, you’ve added some extra non-search-optimized output that you hope undoes some of the work of the search process.
Making your search process into a tool by adding the reason-elucidator hopefully made it less disastrously bad, but it didn’t actually get you a good plan. The problems that you need to solve to get a superhumanly good plan are in fact the same problems you’d need to solve to make the agent safe.
(Sidenote: This can be worked around by giving your tool AI a simplified model of the world and then relying on humans to un-simplify the resulting plan, much like Google Maps makes a plan in an extremely simplified model of the world and then you follow something that sort of looks like that plan. This workaround fails when the task of un-simplifying the plan becomes superhumanly difficult, i.e. right around when things get really interesting, which is why imagining a Google-Maps-like list of safe abstract instructions might be building a false intuition.)
In short, to actually find out the superintelligently awesome plan to solve a problem, you have to have a search process that’s looking for the plan you want. Since this sounds a lot like an agent, and an unfriendly agent is one of the cases we’re most concerned about, it’s easy and common to frame this in terms of an agent.
I’m not 100% sold on explaining actions as a solution here. It seems like the basic sorts of “attack” (exploiting human biases or limitations, sending an unintended message to the supervisor, sneaking a message to a 3rd party that will help contol the reward signal) still work fine—so long as the search process includes tue explainer as part of the environment. And if it doesn’t, we run into the usual issue with such schemes: the AI predictably gets its predictions wrong, and so you need some guarantee that you can keep this AI and its descendants in this unnatural state.
I’ll probably end up mostly agreeing with Integrated Information theories
Ah… x.x Maybe check out Scott Aaronsons’ blog posts on the topic (here and here)? I’m definitely more of the Denettian “consciousness is a convenient name for a particular sort of process built out of lots of parts with mental functions” school.
Anyhow, the reason I focused on drawing boundaries to separate my brain into separate physical systems is mostly historical—I got the idea from the Ebborians (further rambling here. Oh, right—I’m Manfred). I just don’t find mere mass all that convincing as a reason to think that some physical system’s surroundings are what I’m more likely to see next.
Intuitively it’s something like a symmetry of my information—if I can’t tell anything about my own brain mass just by thinking, then I shouldn’t assign my probabilities as if I have information about my brain mass. If there are two copies of me, one on Monday with a big brain and one on Tuesday with a small brain, I don’t see much difference in sensibleness between “it should be Monday because big brains are more likely” and “I should have a small brain because Tuesday is an inherently more likely day.” It just doesn’t compute as a valid argument for me without some intermediate steps that look like the Ebborians argument.
It’s about trying to figure out what’s implied about your brain by knowing that you exist.
It’s also about trying to draw some kind of boundary with “unknown environment to interact with and reason about” on one side and “physical system that is thinking and feeling” on the other side. (Well, only sort of.)
Treating a merely larger brain as more anthropically important is equivalent to saying that you can draw this boundary inside the brain (e.g. dividing big neurons down the middle), so that part of the brain is the “reasoner” and the rest of the brain, along with the outside, is the environment to be reasoned about.
This is boundary can be drawn, but I think it doesn’t match my self-knowledge as well as drawing the boundary based on my conception of my inputs and outputs.
My inputs are sight, hearing, proprioception, etc. My outputs are motor control, hormone secretion, etc. The world is the stuff that affects my inputs and is affected by my outputs, and I am the thing doing the thinking in between.
If I tried to define “I” as the left half of all the neurons in my head, suddenly I would be deeply causally connected to this thing (the right halves of the neurons) I have defined as not-me. These causal connections are like a huge new input and output channel for this defined-self—a way for me to be influenced by not-me, and influence it in turn. But I don’t notice this or include it in my reasoning—Paper and Scissors in the story are so ignorant about it that they can’t even tell which of them has it!
So I claim that I (and they) are really thinking of themselves as the system that doesn’t have such an interface, and just has the usual suite of senses. This more or less pins down the thing doing my thinking as the usual lump of non-divided neurons, regardless of its size.