TurnTrout comments on Richard Ngo’s Shortform

TurnTrout 30 Apr 2020 20:18 UTC
LW: 2 AF: 1
AF
I expect you to respond that being deontological won’t get you to optimality. But I would say that talking about “optimality” here ruins the abstraction, for reasons outlined in my previous comment

I was actually going to respond, “that’s a good point, but (IMO) a different concern than the one you initially raised”. I see you making two main critiques.
1. (paraphrased) ” $A$ won’t produce optimal policies for the specified reward function [even assuming alignment generalization off of the training distribution], so your model isn’t useful” – I replied to this critique above.
2. “The space of goals that agents might learn is very different from the space of reward functions.” I agree this is an important part of the story. I think the reasonable takeaway is “current theorems on instrumental convergence help us understand what superintelligent $A$ won’t do, assuming no reward-result gap. Since we can’t assume alignment generalization, we should keep in mind how the inductive biases of gradient descent affect the eventual policy produced.”
I remain highly skeptical of the claim that applying this idealized theory of instrumental convergence worsens our ability to actually reason about it.

ETA: I read some information you privately messaged me, and i see why you might see the above two points as a single concern.