TurnTrout and Garrett have a post about how we shouldn’t make agents that choose plans by maximizing against some “grader” (i.e. utility function), because it will adversarially exploit weaknesses in the grader.
To clarify: A “grader” is not just “anything with a utility function”, or anything which makes subroutine calls to some evaluative function (e.g. “how fun does this plan seem?”). A grader-optimizer is not an optimizer which has a grader. It is an optimizer which primarily wants to maximize the evaluations of a grader. Compare
“I plan and cleverly optimize my free time to have lots of fun” with
“I want to find the maximally fun-seeming plan (which may or may not involve actually having fun).”
(2) is just you optimizing against your is-fun? grader. On (2), you would gleefully accept a plan which you expect to fool yourself about the future (e.g. you aren’t really going to be having fun, but the act of evaluating the plan tricks you into thinking you will). On (1), you would not consider this plan, because it wouldn’t lead to having fun.
(This distinction is still somewhat hard for me to quickly explain, so to really internalize it you’ll probably have to read and think more about it and not just read a few paragraphs.)
I think any useful agent has to be doing something equivalent to optimizing against a grader internally, and their post doesn’t have a concrete alternative to this.
The post did state that it wasn’t going to explain the alternatives (I figured the post was long enough already):
In this first essay, I explore the adversarial robustness obstacle. In the next essay, I’ll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment.
Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I’ll remove that paragraph. Thanks for the links, I hadn’t read those, and I appreciate the pseudocode.
I think most likely I still don’t understand what you mean by grader-optimizer, but it’s probably better to discuss on your post after I’ve spent more time going over your posts and comments.
My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)? And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?
Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in “I wanna search for plans which make this function output a high number”). Like, the pointof the optimization is to make the number come out high.
(To help you checksum: It feels important to me that “is good at achieving its goals” is not tightly coupled to “approximating argmax”, as I’m talking about those terms. I wish I had fast ways of communicating my intuitions here, but I’m not thinking of something more helpful to say right now; I figured I’d at least comment what I’ve already written.)
To clarify: A “grader” is not just “anything with a utility function”, or anything which makes subroutine calls to some evaluative function (e.g. “how fun does this plan seem?”). A grader-optimizer is not an optimizer which has a grader. It is an optimizer which primarily wants to maximize the evaluations of a grader. Compare
“I plan and cleverly optimize my free time to have lots of fun” with
“I want to find the maximally fun-seeming plan (which may or may not involve actually having fun).”
(2) is just you optimizing against your
is-fun?
grader. On (2), you would gleefully accept a plan which you expect to fool yourself about the future (e.g. you aren’t really going to be having fun, but the act of evaluating the plan tricks you into thinking you will). On (1), you would not consider this plan, because it wouldn’t lead to having fun.(This distinction is still somewhat hard for me to quickly explain, so to really internalize it you’ll probably have to read and think more about it and not just read a few paragraphs.)
The post did state that it wasn’t going to explain the alternatives (I figured the post was long enough already):
You can learn about a concrete alternative via the example value-child cognitive step-through I provided in Don’t align agents to evaluations of plans. You can further read about how robust grading/grader-optimization isn’t necessary in Alignment allows “nonrobust” decision-influences and doesn’t require robust grading, which includes pseudocode for an agent which isn’t a grader-optimizer.
Thanks for clarifying, I misunderstood your post and must have forgotten about the scope, sorry about that. I’ll remove that paragraph. Thanks for the links, I hadn’t read those, and I appreciate the pseudocode.
I think most likely I still don’t understand what you mean by grader-optimizer, but it’s probably better to discuss on your post after I’ve spent more time going over your posts and comments.
My current guess in my own words is: A grader-optimizer is something that approximates argmax (has high optimization power)?
And option (1) acts a bit like a soft optimizer, but with more specific structure related to shards, and how it works out whether to continue optimizing?
Thanks for registering a guess! I would put it as: a grader optimizer is something which is trying to optimize the outputs of a grader as its terminal end (either de facto, via argmax, or intent-alignment, as in “I wanna search for plans which make this function output a high number”). Like, the point of the optimization is to make the number come out high.
(To help you checksum: It feels important to me that “is good at achieving its goals” is not tightly coupled to “approximating argmax”, as I’m talking about those terms. I wish I had fast ways of communicating my intuitions here, but I’m not thinking of something more helpful to say right now; I figured I’d at least comment what I’ve already written.)