Could it be generalizing from T E X T L I K E T H I S and/or mojibake UTF-16 interpreted as UTF-8 with every second character being zero? It’s still a bit more of a stretch from there to generalize to ignoring two intervening constant characters, though.
The component of ignoring two intervening characters is less mysterious to me. For example, a numbered list like “1. first_token 2. second_token …” would need this pattern. I am wondering mostly why the specific map from b’xa1′-b’xba’ to a-z is learned.
Could it be generalizing from T E X T L I K E T H I S and/or mojibake UTF-16 interpreted as UTF-8 with every second character being zero? It’s still a bit more of a stretch from there to generalize to ignoring two intervening constant characters, though.
The component of ignoring two intervening characters is less mysterious to me. For example, a numbered list like “1. first_token 2. second_token …” would need this pattern. I am wondering mostly why the specific map from b’xa1′-b’xba’ to a-z is learned.