Hey Adele—Geodesic checking in here. We plan to just use a completely new token. We’ll have Aaron and his team create the data with something like [token] and then pass just this synthetic dataset through a new tokenizer. So, our final model will have a final vocabularly one larger than our control, which is never seen in the original pre-training corpus.
The ornamental dingbats seem pretty unladen and have some pretty symbols. There’s “🩍” which is maybe the best symbol for depicting a lightcone. The “Vulcan salute” (🖖) has some nice connotations.
FWIW, the “⟐” symbol is used by spiralists a lot (see: https://www.reddit.com/search?q=%E2%9F%90, or https://www.google.com/search?q=%22%E2%9F%90%22+spiral; most uses of the symbol on reddit are by spiralists). Mostly seems to be used as a header element, otherwise only vague connotations but maybe something about sealing or centering.
Hey Adele—Geodesic checking in here. We plan to just use a completely new token. We’ll have Aaron and his team create the data with something like [token] and then pass just this synthetic dataset through a new tokenizer. So, our final model will have a final vocabularly one larger than our control, which is never seen in the original pre-training corpus.
Oh! Thanks for the heads up; we should use something else.
(If you have any suggestions, it’s been hard to find truly unladen symbols)
The ornamental dingbats seem pretty unladen and have some pretty symbols. There’s “🩍” which is maybe the best symbol for depicting a lightcone. The “Vulcan salute” (🖖) has some nice connotations.
Maybe it could be a random string instead of a symbol?
We ended up using the string “XXF”