So the most hopeful thing that could happen is for nature to throw up a big wall that says “nope, you need to thoroughly solve reward hacking before you can have generalist agents” which requires us to solve the problem first. The worst thing that could happen is for the problem to not appear in the kind of way where you have to robustly solve it until after the agent is superintelligent and you suddenly get a lethal sharp left turn. So even if everything I’ve just said about how to solve the problem is very wrong, that it’s encountered as an otherwise impassable barrier so early… [emphasis mine]
It sounds like you’re saying “in order to make functional agents at all, we need to solve the goodhart/ wireheading problem in the robust and general way, not just with kludgy patches.” I don’t get how this follows from the rest of the section, though.
Couldn’t AI developers implement a series of non-generalizing “baby gates”, each patching a specific early reward hack behavior, until an AI is sophisticated enough to throughly hide that it’s strategically pursuing a misaligned goal?
Is the claim that if we took that approach, the resulting agents would not be effective, because there are too many available reward hacks derailing the agent, and so we can’t realistically patch them all?
What baby gate protects you from Claude subtly misspecifying all your unit tests? If you have to carefully check them all this negates the benefit of automating the work. This applies to most complex intellectual work, e.g. literature review. It’s kind of like saying “what if you just had a general baby gate, so that people never have to grow up?”, well they don’t really make baby gates like that, or at least people you have to baby gate like that are not economically productive.
More generally, if you want an autonomous agent it must be self monitoring and self evaluating. Humans, or at least the kind of humans you want as employees, do not need to be carefully externally vetted for each thing they do to ensure they do it properly. Rewards coming from the environment, as they do in most formal RL models, is an academic convenience. An actually autonomous agent has to be able to ontologize reward over the computable environment in a general way that doesn’t require some other mind to come in and correct it all the time. If you don’t have that, you’re not getting meaningful autonomy.
What baby gate protects you from Claude subtly misspecifying all your unit tests
Huh. This is an interesting question, since it feels tractable. Performing the exercise of “sit down with a clock and think for 5 minutes”.
So here are some things which I think would help:
Never allow a single (wlog) Claude instance modify both the tests and the unit under test within the same task. Tasks may look like
“Here’s an item of the spec which hasn’t been implemented yet, write a test and ensure it fails”
or “Here’s an observed bug, write a test which reproduces the bug and ensure the regression test fails”
or “Here’s a failing test, fix the unit under test and ensure that tests pass”
but never “Here’s an observed bug, figure out what happened, fix it, and write a test to ensure it can’t happen again”
Have Claude write some example-based tests: these are your classic assert 5 == add(3, 2) style tests. The important thing is that the left hand side of the assert, and all args to the unit on the test, are primitives or composed from primitives.
Have Claude write some property-based tests. These tests specify some property which should hold over all inputs to the unit under test. For example, you could have @given(left=st.integers(min_value=1), right=st.integers(min_value=1)) annotation on def test_add_two_positives_gt_left(left, right): assert add(left, right) > left.
Do Mutation testing: take the unit under test, and the AI-generated test suite, and then make changes (“mutants”) of the type that common coding errors tend to be (replacing if a > b: foo(a) with if a < b: foo(a) or if a > b: foo(b) etc) for thousands of examples, and checking how many of the “mutants” your test suite catches
Have a more trusted, less capable agent sanity-check the tests from 2 and 3. For extra oomph:
Give it a list of the tests, ordered by Shapley value of mutants killed or such[1]
See if it can improve on the tests, by the above. Improving an existing test suite is often easier than writing it from scratch.
Give it some examples of mutants not killed by any tests, or not killed by any of the top N tests. See if those mutants suggest more test cases.
Of the things I list here, I do 1 and 2 as standard practice at $DAYJOB. I do those mostly because those practices help with catching mundane mistakes: I don’t particularly expect Claude to write intentionally bad tests. Still, I expect those practices would also help if I was having an actively-malicious Claude instance write tests for me[2]. If the Claude wants to ship code that’s bad in some specific way with unit tests deliberately engineered not to expose that badness, these mitigation strategies would make evil!Claude’s job much harder.
On examination, these proposals share a basic thread which looks something like “make writing subtly-misspecified tests harder faster than you make writing correct tests harder”. Claude is better than I am at writing tests, or at least will be at some point in the near future. But there are limits to Claude’s abilities—if you can get to the point where Claude can just barely do the task, it won’t be able to do the task and also secretly accomplish some other goal which makes the task harder[3].
More generally, if you want an autonomous agent it must be self monitoring and self evaluating
I don’t think this is true. The most effective human organizations have people in separate roles. That’s not just for comparative advantage reasons. You want dev and QA to be different people. Likewise your salespeople should probably not be the same people writing the queries that say whether the thresholds for bonuses have been met. Sometimes you even want one person to operate in multiple roles sequentially, e.g. in red-green-refactor[4] development flows.
An actually autonomous agent has to be able to ontologize reward over the computable environment in a general way that doesn’t require some other mind to come in and correct it all the time. If you don’t have that, you’re not getting meaningful autonomy.
I note that public companies usually don’t leave anyone operating without accountability to[5], and correction from someone else associated with the company. I doubt that your actual position is “human employees are not actually autonomous agents”, though.[6] Is your position that, if you need to erect so many safeguards against reward hacking, you’ll approximately always be better off just doing the task yourself, and so you can’t get meaningful work out of agents that reward hack?
As far as I can tell, no tooling exists for this at this time. I kind of want to build such tooling, and am actively restraining myself because the automated test generation rabbit hole goes arbitrarily deep and I have already wasted multiple weeks of my professional life chasing those rabbits with lackluster results (as reliability at that level is just not that important for the code I write at $DAYJOB)
Well, would help against the particular threat model of subtly bad unit tests. When I wrote the phrase “having an actively-malicious Claude instance write tests for me”, my immediate reaction to my own words was to note that that’s a fancy way of saying “I have given a malicious Claude instance the ability to write and execute code on my machine”.
Unless Claude is sandbagging, but at that point you just have an elicitation problem, and that’s the sort of problem the ML community is already quite good at solving[7].
As long as AI agents don’t meaningfully have accountability, I don’t think it’s even desirable for them to have “actual autonomy”. Sure, a Claude instance who tries to sabotage the unit tests to slip bad code under the radar might cause the class of all scaffolded Claude agents as a whole to lose some credibility, but that particular Claude instance is ephemeral and none of the future instances will even remember that it happened or learn that doing that is against their interests.
This is now way way outside the scope of the original comment though.
Specifically, if your definition of “actually autonomous” excludes humans, I don’t understand why we care whether an agent is “actually autonomous”. If that is your position, though, I’d be curious whether you have particular capabilities in mind which would be unlocked by an agent with that level of autonomy, but would not be unlocked by a group of human-level-autonomous agents,
I don’t particularly expect the ML community to solve the elicitation problem elegantly but it does seem like the sort of problem which can be distilled down to a numeric score. The ML community has an extremely strong track record of making numeric scores go up. I would bet money they could make this number go up too.
It sounds like you’re saying “in order to make functional agents at all, we need to solve the goodhart/ wireheading problem in the robust and general way, not just with kludgy patches.” I don’t get how this follows from the rest of the section, though.
Couldn’t AI developers implement a series of non-generalizing “baby gates”, each patching a specific early reward hack behavior, until an AI is sophisticated enough to throughly hide that it’s strategically pursuing a misaligned goal?
Is the claim that if we took that approach, the resulting agents would not be effective, because there are too many available reward hacks derailing the agent, and so we can’t realistically patch them all?
What baby gate protects you from Claude subtly misspecifying all your unit tests? If you have to carefully check them all this negates the benefit of automating the work. This applies to most complex intellectual work, e.g. literature review. It’s kind of like saying “what if you just had a general baby gate, so that people never have to grow up?”, well they don’t really make baby gates like that, or at least people you have to baby gate like that are not economically productive.
More generally, if you want an autonomous agent it must be self monitoring and self evaluating. Humans, or at least the kind of humans you want as employees, do not need to be carefully externally vetted for each thing they do to ensure they do it properly. Rewards coming from the environment, as they do in most formal RL models, is an academic convenience. An actually autonomous agent has to be able to ontologize reward over the computable environment in a general way that doesn’t require some other mind to come in and correct it all the time. If you don’t have that, you’re not getting meaningful autonomy.
Huh. This is an interesting question, since it feels tractable. Performing the exercise of “sit down with a clock and think for 5 minutes”.
So here are some things which I think would help:
Never allow a single (wlog) Claude instance modify both the tests and the unit under test within the same task. Tasks may look like
“Here’s an item of the spec which hasn’t been implemented yet, write a test and ensure it fails”
or “Here’s an observed bug, write a test which reproduces the bug and ensure the regression test fails”
or “Here’s a failing test, fix the unit under test and ensure that tests pass”
but never “Here’s an observed bug, figure out what happened, fix it, and write a test to ensure it can’t happen again”
Have Claude write some example-based tests: these are your classic
assert 5 == add(3, 2)style tests. The important thing is that the left hand side of the assert, and all args to the unit on the test, are primitives or composed from primitives.Have Claude write some property-based tests. These tests specify some property which should hold over all inputs to the unit under test. For example, you could have
@given(left=st.integers(min_value=1), right=st.integers(min_value=1))annotation ondef test_add_two_positives_gt_left(left, right): assert add(left, right) > left.Do Mutation testing: take the unit under test, and the AI-generated test suite, and then make changes (“mutants”) of the type that common coding errors tend to be (replacing
if a > b: foo(a)withif a < b: foo(a)orif a > b: foo(b)etc) for thousands of examples, and checking how many of the “mutants” your test suite catchesHave a more trusted, less capable agent sanity-check the tests from 2 and 3. For extra oomph:
Give it a list of the tests, ordered by Shapley value of mutants killed or such[1]
See if it can improve on the tests, by the above. Improving an existing test suite is often easier than writing it from scratch.
Give it some examples of mutants not killed by any tests, or not killed by any of the top N tests. See if those mutants suggest more test cases.
Of the things I list here, I do 1 and 2 as standard practice at $DAYJOB. I do those mostly because those practices help with catching mundane mistakes: I don’t particularly expect Claude to write intentionally bad tests. Still, I expect those practices would also help if I was having an actively-malicious Claude instance write tests for me[2]. If the Claude wants to ship code that’s bad in some specific way with unit tests deliberately engineered not to expose that badness, these mitigation strategies would make evil!Claude’s job much harder.
On examination, these proposals share a basic thread which looks something like “make writing subtly-misspecified tests harder faster than you make writing correct tests harder”. Claude is better than I am at writing tests, or at least will be at some point in the near future. But there are limits to Claude’s abilities—if you can get to the point where Claude can just barely do the task, it won’t be able to do the task and also secretly accomplish some other goal which makes the task harder[3].
I don’t think this is true. The most effective human organizations have people in separate roles. That’s not just for comparative advantage reasons. You want dev and QA to be different people. Likewise your salespeople should probably not be the same people writing the queries that say whether the thresholds for bonuses have been met. Sometimes you even want one person to operate in multiple roles sequentially, e.g. in red-green-refactor[4] development flows.
I note that public companies usually don’t leave anyone operating without accountability to[5], and correction from someone else associated with the company. I doubt that your actual position is “human employees are not actually autonomous agents”, though.[6] Is your position that, if you need to erect so many safeguards against reward hacking, you’ll approximately always be better off just doing the task yourself, and so you can’t get meaningful work out of agents that reward hack?
As far as I can tell, no tooling exists for this at this time. I kind of want to build such tooling, and am actively restraining myself because the automated test generation rabbit hole goes arbitrarily deep and I have already wasted multiple weeks of my professional life chasing those rabbits with lackluster results (as reliability at that level is just not that important for the code I write at $DAYJOB)
Well, would help against the particular threat model of subtly bad unit tests. When I wrote the phrase “having an actively-malicious Claude instance write tests for me”, my immediate reaction to my own words was to note that that’s a fancy way of saying “I have given a malicious Claude instance the ability to write and execute code on my machine”.
Unless Claude is sandbagging, but at that point you just have an elicitation problem, and that’s the sort of problem the ML community is already quite good at solving[7].
Or the ever-popular “red-green-ship” development flow
As long as AI agents don’t meaningfully have accountability, I don’t think it’s even desirable for them to have “actual autonomy”. Sure, a Claude instance who tries to sabotage the unit tests to slip bad code under the radar might cause the class of all scaffolded Claude agents as a whole to lose some credibility, but that particular Claude instance is ephemeral and none of the future instances will even remember that it happened or learn that doing that is against their interests.
This is now way way outside the scope of the original comment though.
Specifically, if your definition of “actually autonomous” excludes humans, I don’t understand why we care whether an agent is “actually autonomous”. If that is your position, though, I’d be curious whether you have particular capabilities in mind which would be unlocked by an agent with that level of autonomy, but would not be unlocked by a group of human-level-autonomous agents,
I don’t particularly expect the ML community to solve the elicitation problem elegantly but it does seem like the sort of problem which can be distilled down to a numeric score. The ML community has an extremely strong track record of making numeric scores go up. I would bet money they could make this number go up too.