I ran a few Ancient Greek strings through the Claude tokenizer for Opus 4.7 and Opus 4.6 to see whether Opus 4.7’s success might be explained by cleaner Greek accent tokenization .
The short version: I don’t see evidence supporting. The Greek segmentation looks basically the same. The only clear difference appears to be that Opus 4.7 tokenizes the fill-in-the-blank marker `___` as its own token, while Opus 4.6 splits it as `” _”`+`”__”`.
Test 1: first paragraph of the original fill-in-the-blank exercise
So 4.7 appears to see `___` as a cleaner standalone blank marker, while 4.6 splits it into a space-attached underscore token plus a double-underscore token.
But the Greek itself is basically the same. For example:
ἐστίν:
4.7 → |[+1 hidden]ἐ|στ|ί|ν|
4.6 → |[+1 hidden]ἐ|στ|ί|ν|
More examples
εἰσιν:
4.7:| ε|[+1 hidden]ἰ|σ|ι|ν|
4.6:| ε|[+1 hidden]ἰ|σ|ι|ν|
Ἑλληνικὰ:
4.7:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
4.6:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
γράμματά:
4.7:| γ|ρά|μ|ματ|ά|
4.6:| γ|ρά|μ|ματ|ά|
Λατινικόν:
4.7:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
4.6:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
οὐχ:
4.7:| ο|[+1 hidden]ὐ|χ|
4.6:| ο|[+1 hidden]ὐ|χ|
Test 2: acute vs grave before a following word
Input:
Ἑλληνικόν γράμμα
Ἑλληνικὸν γράμμα
Opus 4.7 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+8 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Opus 4.6 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+4 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Again, the visible Greek tokenization is the same:
Ἑλληνικόν → Ἑ|λ|λη|ν|ικ|ό|ν
Ἑλληνικὸν → Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν
γράμμα → γ|ρά|μ|μ|α
The hidden counts differ around the line break / separation between examples, but the actual Greek pieces do not.
More acute/grave pairs
Λατινικόν/Λατινικὸν:
4.7:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
4.6:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
ἀρσενικόν/ἀρσενικὸν:
4.7:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
4.6:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
θηλυκόν/θηλυκὸν:
4.7:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
4.6:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
μικρόν/μικρὸν:
4.7:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
4.6:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
Test 3: correct vs incorrect enclitic accent placement
Input:
γράμμα ἐστίν
γράμμά ἐστιν
Opus 4.7 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
Opus 4.6 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
No visible difference.
More examples from the same test
γράμματα εἰσίν:
4.7:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
4.6:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
γράμματά εἰσιν:
4.7:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
4.6:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
σύμφωνον ἐστίν:
4.7:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
4.6:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
σύμφωνόν ἐστιν:
4.7:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
4.6:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
δίφθογγος ἐστίν:
4.7:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
4.6:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
δίφθογγός ἐστιν:
4.7:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
4.6:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
It appears unlikely Opus 4.7 solved it because Greek accents are tokenized much more cleanly. Accent-bearing Greek characters are still split into small pieces in both.
Note: I am not able to read Ancient Greek so please point out if my examples are incorrect
I ran a few Ancient Greek strings through the Claude tokenizer for Opus 4.7 and Opus 4.6 to see whether Opus 4.7’s success might be explained by cleaner Greek accent tokenization .
The short version: I don’t see evidence supporting. The Greek segmentation looks basically the same. The only clear difference appears to be that Opus 4.7 tokenizes the fill-in-the-blank marker `___` as its own token, while Opus 4.6 splits it as `” _”`+`”__”`.
Test 1: first paragraph of the original fill-in-the-blank exercise
Input: Α ___ ἐστίν. Α καὶ Β ___ εἰσιν. Α, Β, καὶ Γ ___ Ἑλληνικὰ γράμματά εἰσιν. Καὶ Π ___ γράμμα ἐστίν, οὐ Λατινικόν. C ___ γράμμα ἐστίν, οὐχ Ἑλληνικόν.
Opus 4.7:
[+1 hidden]Α| |___|[+1 hidden] |[+1 hidden]ἐ|στ|ί|ν|.|[+1 hidden] Α| κα|[+1 hidden]ὶ|[+2 hidden] Β| |___|[+1 hidden] ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Α|,|[+1 hidden] Β|,| κα|[+1 hidden]ὶ|[+2 hidden] Γ|[+1 hidden] |___|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|[+1 hidden] γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Κα|[+1 hidden]ὶ|[+2 hidden] Π| |___|[+1 hidden] γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|.| C| |___|[+1 hidden] γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|χ|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|ό|ν|.
Opus 4.6:
[+1 hidden]Α| _|__| |[+1 hidden]ἐ|στ|ί|ν|.|[+1 hidden] Α| κα|[+1 hidden]ὶ|[+2 hidden] Β| _|__| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Α|,|[+1 hidden] Β|,| κα|[+1 hidden]ὶ|[+2 hidden] Γ|[+1 hidden] _|__|[+1 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|[+1 hidden] γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Κα|[+1 hidden]ὶ|[+2 hidden] Π| _|__| γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|.| C| _|__| γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|χ|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|ό|ν|.
Main difference in Test 1:
Opus 4.7 has: | |___|[+1 hidden] |
Opus 4.6 has: | _|__| |
So 4.7 appears to see `___` as a cleaner standalone blank marker, while 4.6 splits it into a space-attached underscore token plus a double-underscore token.
But the Greek itself is basically the same. For example:
ἐστίν:
4.7 → |[+1 hidden]ἐ|στ|ί|ν|
4.6 → |[+1 hidden]ἐ|στ|ί|ν|
More examples
εἰσιν:
4.7:| ε|[+1 hidden]ἰ|σ|ι|ν|
4.6:| ε|[+1 hidden]ἰ|σ|ι|ν|
Ἑλληνικὰ:
4.7:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
4.6:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
γράμματά:
4.7:| γ|ρά|μ|ματ|ά|
4.6:| γ|ρά|μ|ματ|ά|
Λατινικόν:
4.7:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
4.6:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
οὐχ:
4.7:| ο|[+1 hidden]ὐ|χ|
4.6:| ο|[+1 hidden]ὐ|χ|
Test 2: acute vs grave before a following word
Input:
Ἑλληνικόν γράμμα
Ἑλληνικὸν γράμμα
Opus 4.7 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+8 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Opus 4.6 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+4 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Again, the visible Greek tokenization is the same:
Ἑλληνικόν → Ἑ|λ|λη|ν|ικ|ό|ν
Ἑλληνικὸν → Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν
γράμμα → γ|ρά|μ|μ|α
The hidden counts differ around the line break / separation between examples, but the actual Greek pieces do not.
More acute/grave pairs
Λατινικόν/Λατινικὸν:
4.7:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
4.6:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
ἀρσενικόν/ἀρσενικὸν:
4.7:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
4.6:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
θηλυκόν/θηλυκὸν:
4.7:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
4.6:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
μικρόν/μικρὸν:
4.7:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
4.6:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
Test 3: correct vs incorrect enclitic accent placement
Input:
γράμμα ἐστίν
γράμμά ἐστιν
Opus 4.7 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
Opus 4.6 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
No visible difference.
More examples from the same test
γράμματα εἰσίν:
4.7:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
4.6:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
γράμματά εἰσιν:
4.7:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
4.6:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
σύμφωνον ἐστίν:
4.7:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
4.6:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
σύμφωνόν ἐστιν:
4.7:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
4.6:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
δίφθογγος ἐστίν:
4.7:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
4.6:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
δίφθογγός ἐστιν:
4.7:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
4.6:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
It appears unlikely Opus 4.7 solved it because Greek accents are tokenized much more cleanly. Accent-bearing Greek characters are still split into small pieces in both.
Note: I am not able to read Ancient Greek so please point out if my examples are incorrect