Filip Sondej comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Filip Sondej 17 Jan 2024 19:51 UTC
7 points
0
Yeah, true. But it’s also easier to do early, when no one is that invested in the hidden-recurrence architectures, and so there’s less resistance, it doesn’t break anyone’s plans.

Maybe a strong experiment would be to compare mamba-3b and some SOTA 3b transformer, trained similarly, on several tasks where we can evaluate CoT faithfulness. (Although maybe at 3b capability level we won’t see clear differences yet.) The hard part would be finding the right tasks.
- Daniel Kokotajlo 17 Jan 2024 21:55 UTC
  7 points
  0
  Parent
  Agreed. I was working on this for six months and I’ve been trying to get more people to work on it.
  
  We don’t have a way of measuring CoT faithfulness as far as I know, in general—but you emphasize “tasks where we can evaluate...” that seems intriguing to me, you are saying it may be feasible today for some tasks at least. What tasks do you have in mind?
  - Filip Sondej 18 Jan 2024 12:25 UTC
    5 points
    0
    Parent
    Unfortunately I didn’t have any particular tasks in mind when I wrote it. I was vaguely thinking about settings as in:
    
    https://arxiv.org/pdf/2305.04388.pdf
    https://arxiv.org/pdf/2307.13702.pdf
    
    Now that I though about it, for this particular transformers vs mamba experiment, I’d go with something even simpler. I want a task that is very easy sequentially, but hard to answer immediately. So for example a task like:
    
    x = 5 x += 2 x *= 3 x **= 2 x -= 3 ...
    
    and then have a CoT:
    
    after x = 5 5 after x += 2 7 ...
    
    And then we intervene on CoT to introduce some error in one operation, but still ask of the model to give the correct answer at the end. (Despite all steps after the error being irrelevant.) We can go even further and train the models to give the correct answer after inadequate CoT. And have a curriculum where at first it only needs to do one hidden operation, later two, and so on.
    
    (It’s an unrealistic setting, but the point is rather to check if the model is able at all to learn hidden sequential reasoning.)
    
    Now, my hypothesis is that transformers will have some limited sequence length for which they can do it (probably smaller than their number of layers), but mamba won’t have a limit.
    
    I was working on this for six months
    
    Can you say what you tried in these six months and how did it go?