Cool work! Just to make sure I understand, your segment embeddings aren’t derived from residual stream of the source model? It’s just the case that many sentences omitting a keyword X still represent X saliently?
That is correct, currently the segment embeddings are not derived from the residual stream, primarily because I do not have compute resources, and I kind of did this alone on my home laptop. The segment embeddings are derived from an external embedder, but I hope to overcome this limitation and see if a range of external embedders give me a different result. Additionally, I do plan to cache the internal embeddings, provided I secure compute resources.
So yes, the finding is that after the removal of the keyword, the CoT sentences still encode what that keyword represents. This is honestly currently a weaker claim, but I think it is a bit more practical because of the pure black-box approach that can be deployed at the user level for closed frontier models.
Cool work! Just to make sure I understand, your segment embeddings aren’t derived from residual stream of the source model? It’s just the case that many sentences omitting a keyword X still represent X saliently?
That is correct, currently the segment embeddings are not derived from the residual stream, primarily because I do not have compute resources, and I kind of did this alone on my home laptop. The segment embeddings are derived from an external embedder, but I hope to overcome this limitation and see if a range of external embedders give me a different result. Additionally, I do plan to cache the internal embeddings, provided I secure compute resources.
So yes, the finding is that after the removal of the keyword, the CoT sentences still encode what that keyword represents. This is honestly currently a weaker claim, but I think it is a bit more practical because of the pure black-box approach that can be deployed at the user level for closed frontier models.