Mitchell_Porter comments on Moral gauge theory: A speculative suggestion for AI alignment

Mitchell_Porter 23 Feb 2025 14:46 UTC
4 points
0
Immediate thoughts: I would want to
(1) examine Jaynes’s analogy in the light of Cosma Shalizi’s critique of Physics from Fisher Information
(2) compare your moral gauge theory to Eric Weinstein’s geometric marginalism (and again, take note of a critique, here due to Timothy Nguyen).
- James Diacoumis 23 Feb 2025 23:03 UTC
  3 points
  0
  Parent
  Thanks for the links! I was unaware of these and both are interesting.
  1. I was probably a little heavy-handed in my wording in the post, I agree with Shalizi’s comment that we should be careful to over-interpret the analogy between physics and Bayesian analysis. However, my goal isn’t to “derive physics from Bayesian analysis” it’s more of a source of inspiration. Physics tells us that continuous symmetries lead to robust conservation laws, so because the mathematics is so similar, if we could force the reward functions to exhibit the same invariance (Noether currents, conservation laws etc..) then , by analogy with the robust generalisation in physics—it might help AI generalise reliably even out of distribution.
  2. If I understand Nguyen’s critique there are essentially two parts:
    a) The gauge theory reformulation is mathematically valid, but trivial, and can be reformulated without gauge theory. Furthermore, he claims that because Weinstein is using such a complex mathematical formalism for doing something trivial he risks obscuritanism.
    My response: I think it’s unlikely that current reward functions are trivially invariant under gauge transformations of the moral coordinates. There are many examples where we have respectable moral frameworks with genuine moral disagreement on certain statements. Current approaches seek to “average over” these disagreements rather than translate between them in an invariant way.
    b) The theory depends on choice of a connection (in my formulation $A_{μ}$ ) which is not canonical. In other words, it’s not clear what choice would capture “true” moral behaviour.
    My response: I agree that this is challenging (which is part of the reason I didn’t try to do it in the post.) However, I think the difficulty is valuable. If our reward function cannot be endowed with these canonical invariances it won’t generalise robustly out of distribution. In that sense, using these ideas as a kind of diagnostic tool to assess whether the reward function possesses some invariance could give us a clue about whether the reward will generalise robustly.