Cool explanation! I think this is probably post-hoc analysis; some experiments that might be stronger evidence:
- Reduce the amount of potentially extraneous information you’re giving it: „In the sentence below, what does marinade mean?“ —Prompt it or a recent model for completions in a way that reveals the internal definition: „Our script prints the returned ‘score’ float. Ok. Now we need to improve ‘solve_code_contests_rust.py’. We also should implement ‘Backoff’ and ‘content filtering’ to ensure stable. But we must ensure whichever we produce is within time. Now we also must ensure marinade, meaning…“ - Something something finding probes that fire on marinade and seeing what they’re correlated to or other interp, not necessarily on GPT-5 but on similarly trained models
I don’t have access to paid versions of GPT-5, so let me know if you do one of these and find something!
Cool explanation! I think this is probably post-hoc analysis; some experiments that might be stronger evidence:
- Reduce the amount of potentially extraneous information you’re giving it: „In the sentence below, what does marinade mean?“
—Prompt it or a recent model for completions in a way that reveals the internal definition: „Our script prints the returned ‘score’ float. Ok. Now we need to improve ‘solve_code_contests_rust.py’. We also should implement ‘Backoff’ and ‘content filtering’ to ensure stable. But we must ensure whichever we produce is within time. Now we also must ensure marinade, meaning…“
- Something something finding probes that fire on marinade and seeing what they’re correlated to or other interp, not necessarily on GPT-5 but on similarly trained models
I don’t have access to paid versions of GPT-5, so let me know if you do one of these and find something!