rvnnt answers Should you publish solutions to corrigibility?

rvnnt 30 Jan 2025 11:54 UTC
1 point
0
Taking a stab at answering my own question; an almost-certainly non-exhaustive list:
- Would the results be applicable to deep-learning-based AGIs?^[1] If I think not, how can I be confident they couldn’t be made applicable?
- Do the corrigibility results provide (indirect) insights into other aspects of engineering (rather than SGD’ing) AGIs?
- How much weight one gives to avoiding x-risks vs s-risks.^[2]
- Who actually needs to know of the results? Would sharing the results with the whole Internet lead to better outcomes than (e.g.) sharing the results with a smaller number of safety-conscious researchers? (What does the cost-benefit analysis look like? Did I even do one?)
- How optimistic (or pessimistic) one is about the common-good commitment (or corruptibility) of the people who one thinks might end up wielding corrigible AGIs.
1. ↩︎
  Something like the True Name of corrigibility might at first glance seem applicable only to AIs of whose internals we have some meaningful understanding or control.
2. ↩︎
  If corrigibility were easily feasible, then at first glance, that would seem to reduce the probability of extinction (via unaligned AI), but increase the probability of astronomical suffering (under god-emperor Altman/Ratcliffe/Xi/Putin/...).