Niclas Kupper comments on Transformers Represent Belief State Geometry in their Residual Stream

Niclas Kupper 24 Apr 2024 12:09 UTC
1 point
0
Is there some theoretical result along the lines of “A sufficiently large transformer can learn any HMM”?
- Alexander Gietelink Oldenziel 24 Apr 2024 12:29 UTC
  3 points
  0
  Parent
  Depending on what one means by ‘learn’ this is provably impossible. The reason has nothing to do with the transformer architecture (which one shouldn’t think of as a canonical architecture in the grand scheme of things anyway).
  
  There is a 2-state generative HMM such that the optimal predictor of the output of said generative model provably requires an infinite number of states. This is for any model of computation, any architecture.
  
  Of course, that’s maybe not what you intend by ‘learn’. If you mean by ‘learn’ express the underlying function of an HMM then the answer is yes by the Universal Approximation Theorem (a very fancy name for a trivial application of the Stone-Weierstrass theorem).
  
  Hope this helped. 😄
  - Niclas Kupper 28 Apr 2024 8:24 UTC
    1 point
    0
    Parent
    Where can I read about this 2-state HMM? By learn I just mean approximate via an algorithm. The UAT is not sufficient as it talks about learning a known function. Baum-Welch is such an algorithm, but as a far as I am aware it gives no guarantees on anything really.