tokens that when appended or inserted into a LLM stream, did actually nothing to the probabilty of following tokens.
(because it would allow you to do search in trajectory embedded space with gradient descent for non-fixed length sequences), then use nearest neighbor to find a tokenized version of that prompt.
I am not sure if that is useful enough to make it worth implementing
I know you could to that by adding a token parameter that deweights that token to the attention mechanism
I am not sure how to detangle the positional encoder (I need to study that mechanism)
Wait, it would be nice if there were no-op tokens
tokens that when appended or inserted into a LLM stream, did actually nothing to the probabilty of following tokens.
(because it would allow you to do search in trajectory embedded space with gradient descent for non-fixed length sequences), then use nearest neighbor to find a tokenized version of that prompt.
I am not sure if that is useful enough to make it worth implementing
I know you could to that by adding a token parameter that deweights that token to the attention mechanism
I am not sure how to detangle the positional encoder (I need to study that mechanism)
I think it is impossible, because the change positional encoder would break relative position inferences.