Noosphere89 comments on Alexander Gietelink Oldenziel’s Shortform

Noosphere89 5 Jan 2025 22:42 UTC
2 points
0
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
- Alexander Gietelink Oldenziel 5 Jan 2025 22:52 UTC
  2 points
  0
  Parent
  I’d be curious how you would describe the core problem of alignment.
  - Noosphere89 5 Jan 2025 23:25 UTC
    2 points
    0
    Parent
    I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
    - Alexander Gietelink Oldenziel 5 Jan 2025 23:36 UTC
      2 points
      0
      Parent
      Let’s focus on inner alignment. By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.