David Matolcsi comments on Obstacles in ARC’s agenda: Finding explanations

David Matolcsi 1 May 2025 20:32 UTC
5 points
0
Yes, this is part of the appeal of catastrophe detectors, that we can make an entire interesting statement fully formal by asking how often a model causes a catastrophe (as defined by a neural net catastrophe detector) on a a formal distribution (defined by a generative neural net with a Gaussian random seed). This is now a fully formal statement but I’m skeptical this helps much., Among other issues:
1. It’s probably not enough to only explain this type of statements to actualize all of ARC’s plans.
2. As I will explain in my next post, I’m skeptical that formalizing through catastrophe detectors actually helps much.
3. If the AI agent whose behavior you want to explain sometimes uses Google search or interacts with humans (very realistic possibilities), you inherently can’t reduce its behavior to formal statements.
4. You need to start training your explanation during pre-training. ARC’s vague hope is that the explanation target is why the model gets low loss of the (empirical) training set. What formally defined statement could be the explanation target during pre-training?
5. Even if the input distribution, the agent and the catastrophe detector are all fully formal, you still need to deal with the capacity allocation problem. The formal input distribution is created by training a generative AI on real-world data. If you are just naively trying to create the highest quality explanation for why the AI agent never causes a catastrophe on the formally defined input distribution, you will probably waste a lot of resources on explaining on why the generative AI creating the inputs behaves the way it does, which makes you uncompetitive with the training of agents, as the agent doesn’t need to understand all the deep causes underlying the input distribution.
- ryan_greenblatt 1 May 2025 20:59 UTC
  3 points
  1
  Parent
  Gotcha. I agree with 1-4, but I’m not sure I agree with 5, at least I don’t agree that 5 is separate from 4.
  
  In particular:
  
  If we can make an explanation for some AI while we’re training it and this is actually a small increase in cost, then we can apply this to input distribution generator. This doesn’t make training uncompetitive with just the agent as it only adds a small factor to some AI (the generator) that we needed to train anyway. So, we shouldn’t need to waste a bunch of resources on the formal input distribution.
  
  I agree this implies that you have to handle making explanations for AIs trained to predict the input distribution, which causes you to hit issues with 4 again.
  - David Matolcsi 1 May 2025 21:05 UTC
    3 points
    0
    Parent
    I agree 4 and 5 are not really separate. The main point is that using formal input distributions for explanations just passes the buck to explain things about the generative AI that defines the formal input distribution, and at some point something needs to have been trained n real data, and we need to explain behavior there.