Agreed (haven’t thought about that).
The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).
my claim is more that “just heuristics” is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).
This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).
Consider the following 2-player turn-based zero-sum game as an example for a task in which “heuristics” seemingly can’t replace a tree search.
The game starts with an empty string. In each turn the following things happen:
(1) the player adds to the end of the string either “A” or “B”.
(2) the string is replaced with its SHA256 hash.
Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.
(Alternatively, consider the 1-player version of this game, starting with a random string.)
My question is, who is thinking directly about how to achieve such coordination (aside from FHI’s Center for the Governance of AI, which I’m aware of) and where are they talking about it?
OpenAI has a policy team (this 80,000 Hours podcast episode is an interview with three people from that team), and I think their research areas include models for coordination between top AI labs, and improving publication norms in AI (e.g. maybe striving for norms that are more like those in computer security, where people are expected to follow some responsible disclosure process when publishing about new vulnerabilities). For example, the way OpenAI is releasing their new language model GPT-2 seems like a useful way to learn about the usefulness/feasibility of new publication norms in AI (see the “Release Strategy” section here).
I think related work is also being done at the Centre for the Study of Existential Risk (CSER).
The claim “I think this coin is heads with probability 50%” is an expression of my own ignorance, and 50% probability means that I’d bet at 1 : 1 odds (or better) that the coin came up heads.
Just a minor quibble—using this interpretation to define one’s subjective probabilities is problematic because people are not necessarily indifferent about placing a bet that has an expected value of 0 (e.g. due to loss aversion).
Therefore, I think the following interpretation is more useful: Suppose I win [some reward] if the coin comes up heads. I’d prefer to replace the winning condition with “the ball in a roulette wheel ends up in a red slot” for any roulette wheel in which more than 50% of the slots are red.
(I think I first came across this type of definition in this post by Andrew Critch)
Thank you for clarifying!
FWIW, when I wrote “the exact same problem but with different labels” I meant “the exact same problem but with different arbitrary names for entities”.
For example, I would consider the following two problems to be “the exact same problem but with different labels”:
“X+1=2 therefore X=”
“Y+1=2 therefore Y=”
But NOT the following two problems:
And this sounds like goal post moving:
I’m failing to see a goal-post-moving between me writing:
It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?
and then later writing (in reply to your comment quoting that sentence):
unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels
If I’m missing something I’d be grateful for a further explanation.
However, Twitter has become worse over time, possibly because it has learned to show me non-academic stuff that is more attention-grabbing or controversial, despite me trying not to click on those sorts of things.
On Twitter you can create a list of relevant people (e.g. people who tend to tweet about relevant papers/posts) and then go over the complete “feed” of just that list, sorted chronologically.
Explicitly saying that you’d like feedback on the newsletter (like you just did in this post) would probably help and as digital_carver suggested you can include a request for feedback in each newsletter. For example, the Import AI newsletter ends with “If you have suggestions, comments or other thoughts you can reach me at … or tweet at me...”
The newsletter is extremely helpful for me for keeping up to date with AI alignment research. I also find the “Other progress in AI” section very helpful.
Both the summaries and the opinion segments are extremely helpful for me!
Overall, I think that reading (or listening to) all the ANs that I’ve read so far was an extremely high EV-per-hour time investment.
Here’s a relevant passage by Rohin (from Alignment Newsletter #49, March 2019):
On the topic of impact measures, I’ll repeat what I’ve said before: I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on human values), safety (preventing any catastrophic outcomes) and usefulness (the AI system is still able to do useful things). Impact measures are very clearly aiming for the first two criteria, but usually don’t have much to say about the third one. My expectation is that there is a strong tradeoff between the first two criteria and the third one, and impact measures have not dealt with this fact yet, but will have to at some point.
Update: The setup described in the OP involves a system that models humans. See this MIRI article for a discussion on some important concerns about such systems.
In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
It would be helpful if people could outline some plausible-seeming scenarios for how divergence between approval and actual preferences could cause a catastrophe, in order to get a better sense for the appropriate noise model.
One scenario that comes to mind: an agent generates a manipulative output that is optimized to be approved by the programmers while causing the agent to seize control over more resources (in a way that is against the actual preferences of the programmers).
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
(I need to think more about this)
We might be interpreting “modest logic-related stuff” differently—I am thinking about simple formal problems like sorting a short list of integers.
I wouldn’t be surprised if GPT-2 (or its smaller version) are very capable at completing strings like “[1,2,” in a way that is merely syntactically correct. Publicly available texts on the internet probably contain a lot of comma-separated number lists in brackets. The challenge is for the model to have the ability to sort numbers (when trained only to predict the next word in internet texts).
However, after thinking about it more I am now less confident that GPT-2 would fail to complete my above sentence with a correctly sorted list, because for any two small integers like 2 and 3 it is plausible that the training data contains more “2,3” strings than “3,2″ strings.
Consider instead the following problem:
“The median number in the list [9,2,1,6,8] is ”
I’m pretty sure that GPT-2 would fail at least 1⁄5 of the times to complete such a sentence (i.e. if we query it multiple times and each time the sentence contains small random integers).
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The training process optimizes only for immediate prediction accuracy.
Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam’s razor in machine learning.
Suppose that due to the regularization, the training results in a system with the goal system: “minimize the expected value of the loss function at the end of the current inference”.(when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word “probability” in a decision-relevant context)For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).
Have you looked at the NLP tasks they evaluated it on?
Yes. Nothing I’ve seen suggests GPT-2 would successfully solve simple formal problems like the one I mentioned in the grandparent (unless a very similar problem appears in the training data—e.g. the exact same problem but with different labels).
I’m pretty sure that GPT-2 would fail to complete even the sentence: “if we sort the list [3,1,2,2] we get [1,”. It’s a cool language model but can it do even modest logic-related stuff without similar examples in the training data?
There are no models of the world involved in the latter
The weights of the neural network might represent something that correspond to an implicit model of the world.
no actions including manipulating a human or inventing exciting proteins.
Putting aside the risk of inner optimizers, suppose we get to superintelligence-level of capabilities, and it turns out that the training process produced a goal system such that the neural network yields some malign output that causes many future invocations of the neural network (indistinguishable from the current invocation) in which a perfect loss function value is achieved.