Charlie Steiner comments on Democratic Fine-Tuning

Charlie Steiner 7 Sep 2023 12:41 UTC
4 points
2
Don’t feel bad about the lack of comments—it’s the lurkers who are wrong :P I’m super excited to see attempts to “just solve the problem,” and I think this kind of approach has a lot going for it.
Here are some various comments I accumulated while reading:
- The really important scenario is “I’m an AI considering modifying myself to more effectively realize human values. What should I do?” To be hyperbolic, other data about human values is relevant only insofar as it generalizes to this scenario.
  
  (Okay, maybe you don’t mean to be working on AGI and only mean to do better alignment of the narrow sort already being done to LLMs. I’m still personally interested in thinking ahead to the implications for AGI.)
  
  To this end, we want to promote generalization and be cautious of high levels of context-dependence. We’ve messed up somewhere if the lessons the AI learns about what to say in the context of abortion have no bearing whatsoever on what it says in the context of self-improving AI. Some context dependence is fine, but the main point is to find the models of human values that generalize to important future questions.
  
  It might be a good idea for a value-finding process to be aware of this fact.
  - Maybe the users can get information about how (and to what extent) behavior in one scenario would generalize to other scenarios, given the current model. They might give different feedback if they know what the broader ramifications of that feedback would be.
  - Another option is meta-level feedback where the users say directly how they want generalization to be done. Actually changing generalization to reflect this feedback is… an open problem.
  - A third option is active inference by the AI, so that it prioritizes collecting information from humans that might influence how it generalizes to important future questions.
- More general values aren’t always better. Generalizing values often means throwing away information about who you are and what you’re like. Sometimes I’ll think that information is important, and I’ll want an AI to take those more-specific values into consideration!
  - So while I admire the spirit of just going for it, I don’t think that connecting expressed values with a broadness/goodness relation is very helpful for aggregation. And de-emphasizing this relation is the same as de-emphasizing the graph structure.
  - The idea of building this graph was a good one to elaborate on even if I don’t think it works out. I am inspired to go try to list 10 similar ideas and see if progress happens.
- Deleting cycles is almost as bad as avoiding all cases where people disagree on a preference ordering. Any democratic process has to be able to handle disagreement! (On second thought, maybe if you’re not aiming for AGI it’s fine to try to just find places where 99% of people agree. I’m still personally interested in handling disagreement.) Rather than deferring to “negotiations,” we might ambitiously try to design AI systems that would be good outcomes of such negotiations if they took place.