FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it’s reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn’t clear that it won’t self-modify (but it’s hard to tell).
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
systems that systematically block the second thing are inevitably gonna systematically block the first thing
I think their proposal is not meant to cause doing-what-the-designer-hopes in response to an incomplete specification, but to be a failsafe in case the specification is unnoticedly wrong, where you expect what you meant to specify to not have certain effects.
(1) If the human has a complete and correct specification, then there isn’t any problem to solve.
(2) If the human gets to see and understand the AI’s plans before the AI executes them, then there also isn’t any problem to solve.
(3) If the human adds a specification, not because the human directly wants that specification to hold, in and of itself, but rather because that specification reflects what the human is expecting a solution to look like, then the human is closing off the possibility of out-of-the-box solutions. The whole point of out-of-the-box solutions is that they’re unexpected-in-advance.
(4) If the human adds multiple specifications that are (as far as the human can tell) redundant with each other, then no harm done, that’s just good conservative design.
(5) …And if the human then splits the specifications into Group A which are used by the AI for the design, and Group B which trigger shutdown when violated, and where each item in Group B appears redundant with the stuff in Group A, then that’s even better, as long as a shutdown event causes some institutional response, like maybe firing whoever was in charge of making the Group A specification and going back to the drawing board. Kinda like something I read in “Personal Observations on the Reliability of the Shuttle” (Richard Feynman 1986):
The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.
Re-reading the post, I think it’s mostly advocating for (5) (which is all good), but there’s also some suggestion of (3) (which would eat into the possibility of out-of-the-box solutions, although that might be a price worth paying).
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won’t create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it’s the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
I didn’t notice suggestion of (3) but I skimmed over some parts.
(Separately, the line “The whole point of out-of-the-box solutions is that they’re unexpected-in-advance” is funny to me / reminded me of this HPMOR scene[1], in that you imply expecting in advance non-specific out-of-the-box solutions, which you can then also have strong expectations to not involve some things (e.g. a program-typing task not involving tiling the world outside this room with copies of the program), but I don’t anticipate we actually disagree)
FYI §14.4 of my post here is a vaguely similar genre although I don’t think there’s any direct overlap.
There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.
Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it’s reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn’t clear that it won’t self-modify (but it’s hard to tell).
But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.
My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.
I think their proposal is not meant to cause doing-what-the-designer-hopes in response to an incomplete specification, but to be a failsafe in case the specification is unnoticedly wrong, where you expect what you meant to specify to not have certain effects.
Hmm, I’ll be more explicit.
(1) If the human has a complete and correct specification, then there isn’t any problem to solve.
(2) If the human gets to see and understand the AI’s plans before the AI executes them, then there also isn’t any problem to solve.
(3) If the human adds a specification, not because the human directly wants that specification to hold, in and of itself, but rather because that specification reflects what the human is expecting a solution to look like, then the human is closing off the possibility of out-of-the-box solutions. The whole point of out-of-the-box solutions is that they’re unexpected-in-advance.
(4) If the human adds multiple specifications that are (as far as the human can tell) redundant with each other, then no harm done, that’s just good conservative design.
(5) …And if the human then splits the specifications into Group A which are used by the AI for the design, and Group B which trigger shutdown when violated, and where each item in Group B appears redundant with the stuff in Group A, then that’s even better, as long as a shutdown event causes some institutional response, like maybe firing whoever was in charge of making the Group A specification and going back to the drawing board. Kinda like something I read in “Personal Observations on the Reliability of the Shuttle” (Richard Feynman 1986):
Re-reading the post, I think it’s mostly advocating for (5) (which is all good), but there’s also some suggestion of (3) (which would eat into the possibility of out-of-the-box solutions, although that might be a price worth paying).
Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won’t create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it’s the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).
I didn’t notice suggestion of (3) but I skimmed over some parts.
(Separately, the line “The whole point of out-of-the-box solutions is that they’re unexpected-in-advance” is funny to me / reminded me of this HPMOR scene[1], in that you imply expecting in advance non-specific out-of-the-box solutions, which you can then also have strong expectations to not involve some things (e.g. a program-typing task not involving tiling the world outside this room with copies of the program), but I don’t anticipate we actually disagree)