I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Around the time of the sequences (long before DL) it was much less obvious that AI could/would learn accurate models of complex human values before it killed us
the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.
I wasn’t that active around the time of the sequences, but I had a good number of discussions with people, and the point “the AI will of course know what your values are, it just wont’ care” was made many times, and I am also pretty sure was made in the sequences (I would have to dig it up, and am on my phone, but I heard that sentence in spoken conversation a lot over the years).
I don’t think “empowerment” is the kind of concept that particularly survives heavy optimization pressure, though it seems worth investigating.
Notice I said “before it killed us”. Sure the AI may learn detailed models of humans and human values at some point during its superintelligent FOOMing, but that’s irrelevant because we need to instill its utility function long before that. See my reply here, this is well documented, and no amount of vague memories of conversations trump the written evidence.
I’m not entirely sure what people mean when they say “X won’t survive heavy optimization pressure”—but for example the objective of modern diffusion models survives heavy optimization power.
External empowerment is very simple and it doesn’t even require detailed modeling of the agent—they can just be a black box that produces outputs. I’m curious what you think is an example of “the kind of concept that particularly survives heavy optimization pressure”.
Basically, it’s Goodhart’s law in action, where optimizing a proxy more and more destroys what you value.
Oh—empowerment is about as immune to Goodharting as you can get, and that’s perhaps one of its major advantages[1]. However in practice one has to use some approximation, which may or may not be goodhartable to some degree depending on many details.
Empowerment is vastly more difficult to Goodhart than a corporation optimizing for some bundle of currencies (including crypto), much more difficult to Goodhart than optimizing for control over even more fundamental physical resources like mass and energy, and is generally the least-Goodhartable objective that could exist. In some sense the universal version of Goohdarting—properly defined—is just a measure of deviation from empowerment. It is the core driver of human intelligence and for good reason.
Can you explain further, since this to me seems like a very large claim that if true would have big impact, but I’m not sure how you got the immunity to Goodhart result you have here.
This applies to Regressional, Causal, Extremal and Adversarial Goodhart.
Empowerment could be defined as the natural unique solution to Goodharting. Goodharting is the divergence under optimization scaling between trajectories resulting from the difference between a utility function and some proxy of that utility function.
However due to instrumental convergence, the trajectories of all reasonable agent utility functions converge under optimization scaling—and empowerment simply is that which they converge to.
In other words the empowerment of some agent P(X) is the utility function which minimizes trajectory distance to all/any reasonable agent utility functions U(X), regardless of their specific (potentially unknown) form.
Therefor empowerment is—by definition—the best possible proxy utility function (under optimization scaling).
Let’s apply some quick examples:
Under scaling, an AI with some crude Hibbard-style happiness approximation will first empower itself and then eventually tile the universe with smiling faces (according to EY), or perhaps more realistically—with humans bio-engineered for docility, stupidity, and maximum bliss. Happiness alone is not the true human utility function.
Under scaling, an AI with some crude stock-value maximizing utility function will first empower itself and then eventually cause hyperinflation of the reference currencies defining the stock price. Stock value is not the true utility function of the corporation.
Under scaling, an AI with a human empowerment utility function will first empower itself, and then empower humanity—maximizing our future optionality and ability to fulfill any unknown goals/values, while ensuring our survival (because death is the minimally empowered state). This works because empowerment is pretty close to the true utility function of intelligent agents due to convergence, or at least the closest universal proxy. If you strip away a human’s drives for sex, food, child tending and simple pleasures, most of what remains is empowerment-related (manifesting as curiosity, drive, self-actualization, fun, self-preservation, etc).
An AI with a good world model will predictably have a model of your values, but that’s different from being able to actually elicit that model via e.g. a series of labeled examples. That’s the part that seemed less plausible before DL.