Jonas Hallgren comments on Reward Function Design: a starter pack

Jonas Hallgren 10 Dec 2025 9:19 UTC
LW: 4 AF: 2
0
AF
TL;DR
I guess the question I’m trying to ask is: What do you think the role of simulation and computation is for this field?
Longer:
Okay, this might be a stupid thought but one could consider MARL environments and for example https://github.com/metta-AI/metta (softmax) to be a sort of generator function of these sorts of reward functions potentially?
Something something it is easier to program constraints into how the reward function and have gradient descent discover it than it is to fully generate it from scratch.
I think there’s mainly a lot of theory work that’s needed here but there might be something to be said about having a simulation part as well where you do some sort of combinatorial search for good reward functions?
(Yes, the thought that it will solve itself if we just bring it in to a cooperative or similar MARL scenario and then do IRL on that is naive but I think it might be an interesting strategy if we think about it as combinatorial search problem that needs to satisfy certain requirements?)
- Steven Byrnes 12 Dec 2025 21:52 UTC
  LW: 8 AF: 5
  0
  AF Parent
  There’s a failure mode I described in “The Era of Experience” has an unsolved technical alignment problem:
  I see many problems, but here’s the most central one: If we have a 100-dimensional parametrized space of possible reward functions for the primary RL system, and every single one of those possible reward functions leads to bad and dangerous AI behavior (as I argued in the previous subsection), then … how does this help? It’s a 100-dimensional snake pit! I don’t care if there’s a flexible and sophisticated system for dynamically choosing reward functions within that snake pit! It can be the most sophisticated system in the world! We’re still screwed, because every option is bad!
  Basically, I think we need more theoretical progress to find a parametrized space of possible reward functions, where at least some of the reward functions in the space lead to good AGIs that we should want to have around.
  I agree that the ideal reward function may have adjustable parameters whose ideal settings are very difficult to predict without trial-and-error. For example, humans vary in how strong their different innate drives are, and pretty much all of those “parameter settings” lead to people getting really messed up psychologically if they’re on one extreme or the opposite extreme. And I wouldn’t know where to start in guessing exactly, quantitatively, where the happy medium is, except via empirical data.
  So it would be very good to think carefully about test or optimization protocols for that part. (And that’s itself a terrifyingly hard problem, because there will inevitably be distribution shifts between the test environment and the real world. E.g. An AI could feel compassionate towards other AIs but indifferent towards humans.) We need to think about that, and we need the theoretical progress.