Rubi J. Hudson

Karma: 687

Rubi J. Hudson 24 Jul 2024 7:58 UTC
LW: 1 AF: 1
0
AF
in reply to: Max Harms’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
I don’t think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.

Rubi J. Hudson 24 Jul 2024 6:54 UTC
LW: 1 AF: 1
0
AF
in reply to: Max Harms’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it’s probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by “anti-natural”? I’ll keep in mind your previous caveat that it’s not definitive.

Rubi J. Hudson 24 Jul 2024 5:39 UTC
3 points
0
in reply to: Seth Herd’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
It feels to me like this argument is jumping ahead to the point that the agent’s goal is to do whatever the principle wants. If we already have that, then we don’t need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we’ve initially pointed at what we want imperfectly.
I agree that it’s possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it’s unclear how an agent trades off against that. Doing something that we clearly don’t want, like manipulation, could still be positive EV if it allows for the generation of high future value.

Rubi J. Hudson 24 Jul 2024 5:29 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That’s why for it to work, the correct utility function would need to start in the ensemble.

Rubi J. Hudson 21 Jul 2024 22:40 UTC
LW: 2 AF: 2
0
AF
in reply to: Max Harms’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don’t. The issue, of course, is that we don’t know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can’t reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don’t want to take to them. I don’t think it really matters whether you call that “anti-naturality that can be overcome with brute force in a simple environment” or just “not anti-naturality”.

Rubi J. Hudson 21 Jul 2024 22:16 UTC
LW: 1 AF: 1
0
AF
in reply to: Max Harms’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it’s programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.

Rubi J. Hudson 21 Jul 2024 22:07 UTC
LW: 3 AF: 2
0
AF
in reply to: Seth Herd’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
I agree that goals as pointers could have some advantages, but I don’t see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it’s goal switched.

Rubi J. Hudson 18 Jul 2024 0:39 UTC
LW: 1 AF: 1
0
AF
in reply to: RogerDearnaley’s comment on: Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.

Rubi J. Hudson 16 Jul 2024 22:47 UTC
LW: 5 AF: 4
0
AF
in reply to: Max Harms’s comment on: 2. Corrigibility Intuition
Hi Max,
I just published the post I mentioned here, which is about half-related to your post. The main thrust of it is that only the resistance to being modified is anti-natural, and that aspect can be targeted directly.

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

42 points

19 comments5 min readLW link

Rubi J. Hudson 24 Jun 2024 20:34 UTC
LW: 2 AF: 2
0
AF
in reply to: Max Harms’s comment on: 2. Corrigibility Intuition
Thanks for pre-empting the responses, that makes it easy to reply!
I would basically agree with the third option. Semantically, I would argue that rather than thinking of that agent as semi-corrigible, we should just think of it as corrigible, and “writes useful self critiques” as a separate property we would like the AI to have. I’m writing a post about this that should be up shortly, I’ll notify you when it’s out.

A Basic Economics-Style Model of AI Existential Risk

Rubi J. Hudson24 Jun 2024 20:26 UTC

22 points

3 comments7 min readLW link

Rubi J. Hudson 22 Jun 2024 5:26 UTC
LW: 2 AF: 2
0
AF
in reply to: Max Harms’s comment on: 2. Corrigibility Intuition
When I say corrigibility as a modifier, I mean it as a transformation that could be applied to a wide range of utility functions. To use an example from the 2015 MIRI paper, you can take most utility functions and add a term that says “if you shut down when the button is pressed, you get utility equal to the expected value of not shutting down”. Alternatively, it could be an optimization constraint that takes a utility function from “Maximize X” to something like “Maximize X s.t. you always shut down when the shutdown button is pushed”. While I’m not advocating for those specific changes, I hope they illustrate what I’m trying to point at as a modifier that is distinct from the optimization goal.

Rubi J. Hudson 16 Jun 2024 22:23 UTC
LW: 4 AF: 3
0
AF
on: 2. Corrigibility Intuition
I’ve read through your sequence, and I’m leaving my comment here, because it feels like the most relevant page. Thanks for taking time to write this up, it seems like a novel take on corrigibility. I also found the existing writing section to be very helpful.

Does it feel like the generator of Cora’s thoughts and actions is simple, or complex? Regardless of how many English words it takes to pin down, does it feel like a single concept that an alien civilization might also have, or more like a gerrymandered hodgepodge of desiderata?
This discussion question captures my biggest critique, which is while this post does a good job capturing the intuition for why the described properties are helpful, it doesn’t convey the intuition that they are parts of the same overarching concept. If we take the CAST approach seriously, and say that corrigibility as anything other than the single target is dangerous, then it becomes really important to put tight bounds on corrigibility so that no additional desiderata are added as secondary targets.
If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.
This analogy, from Part 3a, captures a stark differences in our approaches. I would try to build an MVP, starting with only the most core desiderata (e.g. shuts down when the shut down button is pushed), noticing the holes left that they don’t cover, and adding additional desiderata to patch them. This seems to me to be much more practical of an approach than top-down design, while also being less likely to result in excess targets.

Separately, related to what concepts an alien civilization might have, I still find the idea of corrigibility as a modifier more natural. I find it easy to imagine a paperclip/human values/diamond maximizer that is nonetheless corrigible. In fact, I find the idea of corrigibility as a modifier to arbitrary goals so natural that I’m worried that what you’re describing as CAST is equivalent to some primary goal with the corrigibility modifier. I’m looking suspiciously at the obedience desideratum in particular. That said, while I share your concern about the naive implementation of systems with goals of both corrigibility and something else, I think there may be ways to combine the dual goals that alleviate the danger.

Rubi J. Hudson 6 May 2024 3:32 UTC
LW: 2 AF: 1
1
AF
in reply to: silentbob’s comment on: Searching for Searching for Search
I’d take an agnostic view on whether LLMs are doing search internally. Crucially, though, I think the relevant output to be searching over is distributions of tokens, rather than the actual token that gets chosen. Search is not required to generate a single distribution over next tokens.
I agree that external search via scaffolding can also be done, and would be much easier to identify, but without understanding the internals it’s hard to know how powerful the search process will be.

Rubi J. Hudson 4 Apr 2024 16:05 UTC
LW: 3 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: The Case for Predictive Models
Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives.
- You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.
If we’re looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we’re going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don’t see in current systems. Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.
My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they’re used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.

Rubi J. Hudson 3 Apr 2024 20:35 UTC
LW: 5 AF: 4
0
AF
in reply to: ryan_greenblatt’s comment on: The Case for Predictive Models
I’d be very interested in hearing the reasons why you’re skeptical of the approach, even a bare-bones outline if that’s all you have time for.

The Case for Predictive Models

Rubi J. Hudson3 Apr 2024 18:22 UTC

43 points

7 comments8 min readLW link

Searching for Searching for Search

Rubi J. Hudson14 Feb 2024 23:51 UTC

21 points

4 comments7 min readLW link

Rubi J. Hudson 8 Feb 2024 0:08 UTC
1 point
0
in reply to: Jacob Pfau’s comment on: Abram Demski’s ELK thoughts and proposal—distillation
Ah, ok, I see what you’re saying now. I don’t see any reason why restricting to input space counterfactuals wouldn’t work, beyond the issues described with predictor-state counterfactuals. Possibly a performance hit from needing to make larger changes. In the worst case, a larger minimum change size might hurt with specifying the direct reporter.

Rubi J. Hudson

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

A Ba­sic Eco­nomics-Style Model of AI Ex­is­ten­tial Risk

The Case for Pre­dic­tive Models

Search­ing for Search­ing for Search

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

A Basic Economics-Style Model of AI Existential Risk

The Case for Predictive Models

Searching for Searching for Search