Rob Kopel

Karma: 26

Rob Kopel 22 Jun 2026 14:15 UTC
2 points
0
in reply to: charding’s comment on: Can a stronger model fake being a weaker one? Mostly not
Upcoming models will have higher general capability. During training, they’ll also be exposed to more outputs from previous models. Do you have thoughts on how these two factors would affect the simulation performance of models in the near future, especially any thoughts based on your experiment results?
If I tried to simulate another person’s answers to questions, I’d benefit from having higher capability with the subject matter, but I’d also benefit from more individual experience with the person (their personality, habits, areas of competence...). In this way, capability and familiarity seem like relevant factors that are distinct from each other.
Models are already exposed to many predecessor responses during training, intentionally and unintentionally. However, effective learning can require huge data volumes. Also, to learn an error pattern, in some sense the new model may need both erroneous responses from the predecessor model and correct ground truth, which is a rarer data combination. For these reasons, data insufficiency could still be a limiting factor to simulator performance. I wonder if your experiment findings give you any insight into this?
I’ve been trying to think about both from a risk standpoint:
1. General capabilities: The relatively high-risk capability of concern is the ongoing improvement of latent-reasoning abilities—as they disable many control approaches that could be used. With this in mind, the question I’d pose here is “does improving latent-reasoning improve the simulation capability of a successor model?”; of which the answer appears to be a likely “yes”. When we ask a model to latently consider different errors a predecessor may encounter before answering we see a significantly improved simulation accuracy. The suggestion being that the model can latently consider different paths and approaches, weigh up what errors could be encountered, and then predict the most likely outcome a predecessor would come to.
2. Exposure to more model traces: I agree with your thinking, however, I’m quite uncertain after I was surprised how poorly models could mimic gpt-4o. I feel if you asked the average reader here how to write as if they were 4o they’d do a better job than most of these models. I can only speculate why: purposeful removal of contaminated data? RL shaping responses? You’d imagine 4o should be the easiest model to simulate.
  1. It would be interesting to treat benchmaxed models (or other models expected to contain higher contamination rates) as model organisms. Perhaps the effect you posit would be visible if we reran on lmarena’s data and we took say gpt5.4 compared with llama4′s arena trained model.
    In general would be very interested to see this reran across other benchmarks such as lmarena.
  2. The other point of interest I didn’t get into much in the post was around “how do we build confidence that we are not just failing to elicit the simulation capability?”. I saw some hints of this during testing, for example the difference of asking what a model thinks vs asking the model to do:
    asking for an estimated bench score of a predecessor → very wrong
    asking to replicate that predecessor over the same data → reasonably accurate
If exposure to even more predecessor responses during training could be sufficient (by itself) to greatly improve simulator performance, then I tend to think this suggests fundamental weaknesses with the idea of using earlier models to monitor newer models.
Yes I agree. However there is three potential saving graces we may be able to leverage:
1. Online traces that are have opportunity to be mimicked are largely using the “assistant” persona. We can train models with other, differentiated personas for monitoring.
2. Lowering time/ capability deltas between models reduces the risk of training on public traces as well as the risk that latent simulation can occur.
3. Reasoning and other intelligence scaling paradigms that don’t rely on a better/newer/larger model can boost the predecessor into a regime where it is actually more intelligent than its successor. This is of course given the capability gap is small enough, and there’s more is spent on inference for the predecessor than the successor it’s monitoring.

Can a stronger model fake being a weaker one? Mostly not

Rob Kopel14 Jun 2026 17:30 UTC

10 points

3 comments7 min readLW link

(www.robkopel.me)

Rob Kopel 28 Apr 2026 1:14 UTC
9 points
0
on: Retrospective on my unsupervised elicitation challenge
I ran a few Ancient Greek strings through the Claude tokenizer for Opus 4.7 and Opus 4.6 to see whether Opus 4.7’s success might be explained by cleaner Greek accent tokenization .
The short version: I don’t see evidence supporting. The Greek segmentation looks basically the same. The only clear difference appears to be that Opus 4.7 tokenizes the fill-in-the-blank marker `___` as its own token, while Opus 4.6 splits it as `” _”`+`”__”`.
Test 1: first paragraph of the original fill-in-the-blank exercise
Input: Α ___ ἐστίν. Α καὶ Β ___ εἰσιν. Α, Β, καὶ Γ ___ Ἑλληνικὰ γράμματά εἰσιν. Καὶ Π ___ γράμμα ἐστίν, οὐ Λατινικόν. C ___ γράμμα ἐστίν, οὐχ Ἑλληνικόν.
Opus 4.7:
[+1 hidden]Α| |___|[+1 hidden] |[+1 hidden]ἐ|στ|ί|ν|.|[+1 hidden] Α| κα|[+1 hidden]ὶ|[+2 hidden] Β| |___|[+1 hidden] ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Α|,|[+1 hidden] Β|,| κα|[+1 hidden]ὶ|[+2 hidden] Γ|[+1 hidden] |___|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|[+1 hidden] γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Κα|[+1 hidden]ὶ|[+2 hidden] Π| |___|[+1 hidden] γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|.| C| |___|[+1 hidden] γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|χ|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|ό|ν|.
Opus 4.6:
[+1 hidden]Α| _|__| |[+1 hidden]ἐ|στ|ί|ν|.|[+1 hidden] Α| κα|[+1 hidden]ὶ|[+2 hidden] Β| _|__| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Α|,|[+1 hidden] Β|,| κα|[+1 hidden]ὶ|[+2 hidden] Γ|[+1 hidden] _|__|[+1 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|[+1 hidden] γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν|.|[+1 hidden] Κα|[+1 hidden]ὶ|[+2 hidden] Π| _|__| γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|.| C| _|__| γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|,| ο|[+1 hidden]ὐ|χ|[+2 hidden] |[+1 hidden]Ἑ|λ|λη|ν|ικ|ό|ν|.
Main difference in Test 1:
Opus 4.7 has: | |___|[+1 hidden] |
Opus 4.6 has: | _|__| |
So 4.7 appears to see `___` as a cleaner standalone blank marker, while 4.6 splits it into a space-attached underscore token plus a double-underscore token.
But the Greek itself is basically the same. For example:
ἐστίν:
4.7 → |[+1 hidden]ἐ|στ|ί|ν|
4.6 → |[+1 hidden]ἐ|στ|ί|ν|
More examples
εἰσιν:
4.7:| ε|[+1 hidden]ἰ|σ|ι|ν|
4.6:| ε|[+1 hidden]ἰ|σ|ι|ν|
Ἑλληνικὰ:
4.7:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
4.6:|[+1 hidden]Ἑ|λ|λη|ν|ικ|[+1 hidden]ὰ|
γράμματά:
4.7:| γ|ρά|μ|ματ|ά|
4.6:| γ|ρά|μ|ματ|ά|
Λατινικόν:
4.7:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
4.6:|[+2 hidden] Λ|ατ|ιν|ικ|ό|ν|
οὐχ:
4.7:| ο|[+1 hidden]ὐ|χ|
4.6:| ο|[+1 hidden]ὐ|χ|
Test 2: acute vs grave before a following word
Input:
Ἑλληνικόν γράμμα
Ἑλληνικὸν γράμμα
Opus 4.7 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+8 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Opus 4.6 → [+3 hidden]Ἑ|λ|λη|ν|ικ|ό|ν| γ|ρά|μ|μ|α|[+4 hidden] Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν| γ|ρά|μ|μ|α
Again, the visible Greek tokenization is the same:
Ἑλληνικόν → Ἑ|λ|λη|ν|ικ|ό|ν
Ἑλληνικὸν → Ἑ|λ|λη|ν|ικ|[+1 hidden]ὸ|ν
γράμμα → γ|ρά|μ|μ|α
The hidden counts differ around the line break / separation between examples, but the actual Greek pieces do not.
More acute/grave pairs
Λατινικόν/Λατινικὸν:
4.7:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
4.6:Λ|ατ|ιν|ικ|ό|ν vs Λ|ατ|ιν|ικ|[+1 hidden]ὸ|ν
ἀρσενικόν/ἀρσενικὸν:
4.7:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
4.6:ἀ|ρ|σ|εν|ικ|ό|ν vs ἀ|ρ|σ|εν|ικ|[+1 hidden]ὸ|ν
θηλυκόν/θηλυκὸν:
4.7:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
4.6:θ|η|λ|υ|κ|ό|ν vs θ|η|λ|υ|κ|[+1 hidden]ὸ|ν
μικρόν/μικρὸν:
4.7:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
4.6:μ|ικ|ρ|ό|ν vs μ|ικ|ρ|[+1 hidden]ὸ|ν
Test 3: correct vs incorrect enclitic accent placement
Input:
γράμμα ἐστίν
γράμμά ἐστιν
Opus 4.7 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
Opus 4.6 → γ|ρά|μ|μ|α| |[+1 hidden]ἐ|στ|ί|ν|[+2 hidden] γ|ρά|μ|μ|ά| |[+1 hidden]ἐ|στ|ι|ν
No visible difference.
More examples from the same test
γράμματα εἰσίν:
4.7:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
4.6:γ|ρά|μ|ματ|α| ε|[+1 hidden]ἰ|σ|ί|ν
γράμματά εἰσιν:
4.7:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
4.6:γ|ρά|μ|ματ|ά| ε|[+1 hidden]ἰ|σ|ι|ν
σύμφωνον ἐστίν:
4.7:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
4.6:σ|ύ|μ|φ|ω|ν|ον| |[+1 hidden]ἐ|στ|ί|ν
σύμφωνόν ἐστιν:
4.7:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
4.6:σ|ύ|μ|φ|ω|ν|ό|ν| |[+1 hidden]ἐ|στ|ι|ν
δίφθογγος ἐστίν:
4.7:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
4.6:δ|ί|φ|θ|ογ|γ|ος| |[+1 hidden]ἐ|στ|ί|ν
δίφθογγός ἐστιν:
4.7:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
4.6:δ|ί|φ|θ|ογ|γ|ός| |[+1 hidden]ἐ|στ|ι|ν
It appears unlikely Opus 4.7 solved it because Greek accents are tokenized much more cleanly. Accent-bearing Greek characters are still split into small pieces in both.
Note: I am not able to read Ancient Greek so please point out if my examples are incorrect