abramdemski comments on Don’t punish yourself for bad luck

abramdemski 26 Jun 2020 16:07 UTC
9 points
I just want to mention that this is an example of the credit assignment problem. Broadly punishing/rewarding every thought process when something happens is policy-gradient learning, which is going to be relatively slow because (1) you get irrelevant punishments and rewards due to noise, so you’re “learning” when you shouldn’t be; (2) you can’t zero in on the source of problems/successes, so you have to learn through the accumulation of the weak and noisy signal.
So, model-based learning is extremely important. In practice, if you lose a game of magic (or any game with hidden information and/or randomness), I think you should rely almost entirely on model-based updates. Don’t denigrate strategies only because you lost; check only whether you could have done something better given the information you had. Plan at the policy level.
OTOH, model-based learning is full of problems, too. If your models are wrong, you’ll identify the wrong sub-systems to reward/punish. I’ve also argued that if your model-based learning is applied to itself, IE, applied to the problem of correcting the models themselves, then you get loopy self-reinforcing memes which take over the credit-assignment system and employ rent-seeking strategies.
I currently see two opposite ways out of this dilemma.
1. Always use model-free learning as a backstop for model-based learning. No matter how true a model seems, ditch it if you keep losing when you use it.
2. Keep your epistemics uncontaminated by instrumental concerns. Only ever do model-based learning; but don’t let your instrumental credit-assignment system touch your beliefs. Keep your beliefs subservient entirely to predictive accuracy.
Both of these have some distasteful aspects for rationalists. Maybe there is a third way which puts instrumental and epistemic rationality in perfect harmony.
PS: I really like this post for relating a simple (but important) result in mechanism design (/theory-of-the-firm) with a simple (but important) introspective rationality problem.
- Dagon 26 Jun 2020 17:46 UTC
  3 points
  Parent
  Thanks for this comment—it highlights that the post _is_ an attempt in the right direction (model-based learning, rather than pure outcome learning). And that it’s possibly the wrong model (effort level is an insufficient causal factor).
  - abramdemski 26 Jun 2020 17:58 UTC
    3 points
    Parent
    Ah yeah, I didn’t mean to be pointing that out, but that’s an excellent point—“effort” doesn’t necessarily have anything to do with it. You were using “effort” as a handle for whether or not the agent is really trying, which under a perfect rationality assumption (plus an assumption of sufficient knowledge of the situation) would entail employing the best strategy. But in real life conflating effort with credit-worthiness could be a big mistake.