I don’t know what the samples have to do with it. I mean simply that the parameters of the model are initialized randomly and from the start all tokens will cause slightly different model behaviors, even if the tokens are never seen in the training data, and this will remain true no matter how long it’s trained.
The post shows 32 words being tested. The 16 positive valance words all have higher weight with normal tokenization, the 16 low valance words have higher with unusual tokenization. This is extremely improbable by raw chance.
There’s no reason whatsoever to think that they are independent of each other. The very fact that you can classify them systematically as ‘positive’ or ‘negative’ valence indicates they are not and you don’t know what ‘raw chance’ here yields. It might be quite probable.
I don’t know what the samples have to do with it. I mean simply that the parameters of the model are initialized randomly and from the start all tokens will cause slightly different model behaviors, even if the tokens are never seen in the training data, and this will remain true no matter how long it’s trained.
The post shows 32 words being tested. The 16 positive valance words all have higher weight with normal tokenization, the 16 low valance words have higher with unusual tokenization. This is extremely improbable by raw chance.
There’s no reason whatsoever to think that they are independent of each other. The very fact that you can classify them systematically as ‘positive’ or ‘negative’ valence indicates they are not and you don’t know what ‘raw chance’ here yields. It might be quite probable.
Right, I think that’s the hypothesis I was asking whether you had when I said
If so, yeah, this is compatible. I’d still put notably higher odds on the original thing I suggested, but this is the other main hypothesis.