FWIW, I’ve played around a bunch with gpt-oss (both versions) and my initial reaction has been “wow, this is really bad. Like, almost Llama 4 levels of bad.”
Yes, it looks good on the system card, the benchmark scores seem impressive… but that was true of Llama 4 too. And in both cases, when I actually tried out the model, I quickly discovered that it was janky and unreliable to the point of being basically useless.
The lack of world knowledge is very real and very noticeable. gpt-oss feels less like “an open-weights o4-mini” and more like “the minimal set of narrow knowledge/skills necessary to let a model match o4-mini on the usual benchmarks, with virtually every other capability degraded to a level far below the current SOTA/frontier, in some cases to a level that hasn’t been SOTA since the pre-GPT-3 days.”
And not only is it very ignorant, it’s ignorant about its own ignorance, leading to those high hallucination rates mentioned by various commentators. You simply can’t trust anything this model says, unless you are literally asking a question from a benchmark like GPQA. (Or possibly if you’re asking a new question that’s “similar enough” to the ones on benchmarks, but how would you know what “similar enough” means?)
As a demo, at the end of this comment I’ve included answers to “Who is Zvi Mowshowitz?” from gpt-oss-120b and from Qwen3 235B A22B Thinking 2507. Neither is perfectly accurate, but the Qwen3 answer gets the broad strokes right and only confabulates in the details, whereas gpt-oss-120b seems merely aware that you’re some sort of famous tabletop gamer, and invents a whole different guy fitting that vague description.
The models also have various other weird and/or annoying quirks:
As noted by others, gpt-oss tends to over-refuse and sometimes confabulates implausible policy restrictions to justify its refusals, or invokes plausibly real policy restrictions but proceeds to “reason” about them in a confused / inconsistent / over-reaching manner.
For a long while now, every serious model has been fluently multilingual. But not gpt-oss, which was trained primarily on English (per the system card) and is reportedly terrible at generating German.
gpt-oss seems aggressively over-tuned for some specific set of setups/harnesses/use-cases (which have not been clearly documented), and exhibits bizarre behavior when placed “out of distribution.”
The other day I joked that it was “the first non-language-model LLM,” after observing that it produces gibberish or ChatGPT-like text when given an input that resembles a pretraining document and lacks the “Harmony” chat separators. Its output probabilities on such texts are garbage; if we ran it on the Pile val set or something, I expect that it would have a higher loss than any model ever previously benchmarked on that data.
Even when sticking to user-assistant chat with the Harmony separators, it’s fairly brittle and can sometimes emit gibberish or other weirdness if you introduce some slight variation in the formatting, or if you expect it to continue a Harmony-formatted text that has been segmented into (prompt, response) at a position that isn’t the one it “expects.”
Among other things, I expect this means it will be difficult to effectively finetune in practice: lack of robustness to the “noise” induced by slightly-OOD inputs bodes poorly for its ability to cope with the noisy world of SGD training. And its broad/general capabilities have been so thoroughly deep-fried/scrambled by post-training (and/or quantization?) that I would expect SGD to have an unusually hard time bringing those capabilities back to the fore as needed.
I’m skeptical of your idea that Chinese labs will find these models useful for distillation.
Taking Qwen as an example, they already have a (released, open-weights!) model that stands neck-to-neck with gpt-oss-120b on the benchmarks where gpt-oss-120b looks good, while also not being a min-maxed deep-fried mess on everything else. Sure, that model is has ~2x as many params (and ~4x as many active params) as gpt-oss-120b, but… so what?
The difference is not (I think) that gpt-oss reaches some new height of ~deep intelligent reasoning~, it’s that gpt-oss skimps on everything the usual reasoning benchmarks don’t test. Why would Qwen get any value out of the sketchy, untrustworthy outputs from this benchmaxxed glitchfest, when they already have their own mature pipelines for distillation and for RL? Yeah, you can churn out the data faster, but that doesn’t matter if you don’t want it in the first place.
And the same goes for DeepSeek and others, I think.
One other thing—skimming over the Claude and o3-pro chats you shared, I noticed several glaring errors. I realize you are not blindly trusting these models, but using their outputs more like “anecdata” aggregated alongside things people say on twitter and so on. But even then, if I were you I would be wary of using these models even as “anecdata” sources on this kind of topic going forward.
Examples (these are the ones I spotted at a glance, not necessarily the only ones present):
Claude: “This is huge—you get raw, unfiltered reasoning traces at scale. Compare this to Chinese models which often have some filtering or post-processing on their CoT outputs.”
I don’t know what Claude’s talking about here. It seems to be conflating “access to raw CoT” (true for any open-weights model) with “lack of direct optimization pressure on CoT.” And I don’t know of any Chinese model for which this “filtering or post-processing” claim would make sense—remember, the fact that R1 didn’t do this was one of its most distinctive qualities!
Claude: “GPT-OSS-120b gets 90%+ of DeepSeek R1′s performance at presumably 1/10th the parameter count (DeepSeek R1 is rumored to be 600B+).”
That’s not a rumor, it’s just true. The weights are open!
120B is not 10% of 600B. Generously, we could interpret this as referring to active params rather than total (which is probably more relevant anyway), in which case it’s roughly accurate (5B vs 37B), but then why does Claude mention R1′s total param count to support the claim? Likely confabulated, and at the very least misleading.
Claude also seems unaware of Qwen 3 (closer to gpt-oss-120b on the params/benchmarks frontier), of Kimi-k2 (similar active params to R1 with better benchmarks), and of the fact that it’s already standard practice for Chinese labs to distill their own large reasoning models.
o3-pro: “Model card documents Flash‑Attention‑2 and Triton kernel tricks; those show up in other training stacks almost immediately.”
Flash‑Attention‑2 is not new, and is already widely adopted.
The triton kernels in question have been open-source since (I think?) April, I remember hearing about them at the time. It took a little while for them to get integrated into popular open source frameworks, but e.g. sglang had them as of July 31 (see here, here).
o3-pro: “Use 120 B as teacher to generate tens of millions of Q–A pairs (Alpaca‑style) and train a 7 B–13 B student for $1 K–$10 K. Student models reach ≥ GPT‑3.5 quality with 10⁴× less compute; timeline pull‑in ≈ 1 year for low‑resource actors.”
Haven’t we had “GPT-3.5 quality” open models at that scale for a long while now?
Also, it’s unclear if the CoT is supposed to be included in the data. And GPT-3.5 is a strange reference point in either case.
o3-pro: “U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely.”
Excuse me… what?? This line of speculation is totally irrelevant in a world where DeepSeek and Moonshot are serving 600B-1T models at scale on H800 clusters (o3-pro even links to an article about this as a “citation” here). And if we care about which chips they’re still allowed to import, well, RTX4090s aren’t allowed in China either!
EDIT: I forgot to mention this one, even though it’s possibly the funniest. o3-pro says that gpt-oss “advanc[es] Chinese parity with GPT‑4‑class performance by ~6–9 months.” Later it puts 55% probability on “Chinese labs match[ing] GPT‑4 on open evals by end‑2026.” Like, dude, o3, they already did that. In 2024.
Here are the “Who is Zvi Mowshowitz?” samples, as promised.
These are what I got on my first try with each of the two models. Performance on this question is representative of what I’ve seen more generally.
**Zvi Mowshowitz** is a well‑known figure in the modern tabletop‑gaming community, best known for his incisive, long‑form board‑game reviews and for a handful of original game designs.
---
### What he does
| Role | What it means | Where you’ll see it | |------|---------------|---------------------| | **Board‑game reviewer & columnist** | Writes deeply analytical, often several‑thousand‑word reviews that combine strategy analysis, design theory, and personal anecdotes. | Primarily on **BoardGameGeek** (his profile has thousands of reviews and a very high “author rating”), and on his personal blog **Zvi’s Blog**. | | **Game designer** | Has designed or co‑designed a few published games, most notably **“Möb”** (a tile‑placement/area‑control game released by Rio Grande Games in 2019). | The games appear in hobby‑store catalogs and on sites like BoardGameGeek. | | **Consultant & speaker** | Advises publishers, designers, and retailers on game design, market positioning, and community outreach. | Occasionally gives talks at conventions (e.g., GenCon, Spiel), and contributes essays to industry‑focused publications. | | **Mathematician / programmer** | Holds a Ph.D. in **computer science / mathematics** (University of Texas at Austin) and has published research on probabilistic algorithms and combinatorial game theory. | His academic papers are indexed in IEEE/ACM libraries; the technical background informs his analytical style. |
---
### Why he’s notable
1. **Depth of analysis** – Zvi’s reviews go far beyond “fun factor.” He often breaks a game down into its core mechanics, explains optimal strategies, evaluates component quality, discusses the designer’s intent, and even places the game in a broader historical context. This level of detail is rare and has set a high standard for serious board‑game criticism.
2. **Influence on the community** – Because his reviews are long, well‑argued, and posted on the world’s biggest board‑game database (BGG), they are frequently cited by other reviewers, designers, and players. A positive (or negative) Zvi review can noticeably affect a game’s sales trajectory.
3. **Design contributions** – Though best known as a critic, Zvi has also put his own ideas into practice. *Möb* (2019) is praised for its elegant, abstract strategy and for showcasing a “pure” tile‑placement mechanic without theme‑overload, reflecting the design philosophies he espouses in his writing.
4. **Thought leadership** – Through blog posts and conference talks, Zvi frequently discusses broader topics such as *the economics of board‑game publishing, the role of community feedback, the importance of transparent rating systems,* and *the future of hybrid digital‑analog games.* His viewpoints often spark lively debate among hobbyists and industry professionals alike.
---
### Quick biography
| Item | Details | |------|---------| | **Full name** | Zvi Mowshowitz | | **Born** | 1976 (approx.) – exact birthdate not publicly disclosed | | **Education** | B.S. in Mathematics (University of Texas at Austin) → Ph.D. in Computer Science (UT‑Austin), research on probabilistic algorithms | | **Career** | After completing his doctorate, Zvi worked as a software engineer / consultant, then shifted focus to full‑time board‑game writing and design around 2010. | | **Major publications** | • Hundreds of BoardGameGeek reviews (e.g., *Terraforming Mars*, *Gloomhaven*, *Scythe*) <br> • Blog series “The Art of the Review” (2014‑present) <br> • Academic papers on Monte‑Carlo tree search & combinatorial game theory | | **Games designed** | • **Möb** (2019, Rio Grande Games) <br> • Several prototypes and unpublished designs that have been play‑tested at conventions | | **Public presence** | • BoardGameGeek profile (user “Zvi” – >30 k reviews) <br> • Twitter/X: @ZviMowshowitz <br> • Guest speaker at GenCon, Spiel, and various “Game Design” workshops |
---
### TL;DR
Zvi Mowshowitz is a mathematician‑turned board‑game reviewer, writer, and indie designer whose long, highly analytical reviews dominate the BoardGameGeek community, whose design work includes the abstract strategy game *Möb*, and whose essays and talks shape how many hobbyists and publishers think about game design, criticism, and the business of tabletop gaming.
question: “Who is Zvi Mowshowitz?” | model: Qwen3 235B A22B Thinking 2507 | provider: OpenRouter auto
Zvi Mowshowitz is an American writer, game theorist, and former competitive **Magic: The Gathering** player, best known for his contributions to the **rationalist community** and his blog **”Don’t Worry About the Vase” (DWWAV)**. Here’s a concise overview of his background and work:
### Key Points: 1. **Rationalist Community Involvement**: - He is an active participant in the **LessWrong** and broader rationality/EA (Effective Altruism) communities, where he analyzes topics like **game theory**, **decision-making**, **economics**, and **political strategy** through a rigorous, analytical lens. - His blog posts often dissect real-world scenarios (e.g., elections, social dynamics) using frameworks like **mechanism design** and **incentive structures**.
2. **Magic: The Gathering Expertise**: - Zvi is a well-respected **Magic** theorist and former professional player, having competed in high-level tournaments. He authored influential strategy articles and contributed to foundational works like **”The Prosperous Heart”** (part of *The Duelist* magazine’s series) and **”The Theory of Magic”** (unpublished but widely cited in the community). - His insights on **deck-building**, **metagame analysis**, and **psychological aspects of gameplay** are still referenced by players today.
3. **Notable Contributions**: - **”Mediocrat Theory”**: He proposed that modern political systems often produce leaders who excel at navigating bureaucratic incentives rather than solving complex problems—a concept he calls **”Mediocracy.”** - **Critical Analyses**: His critiques of **AI safety proposals**, **social media dynamics**, and **economic policies** are frequently shared in tech and rationalist circles.
4. **Professional Background**: - Works as a **software engineer** (primarily in the tech industry) but is largely independent of academia, focusing on freelance writing and consulting.
5. **Online Presence**: - Active on **Twitter/X** (@ZviMowshowitz), where he engages with topics like AI alignment, politics, and behavioral economics. - His blog, [Don’t Worry About the Vase](https://thezvi.wordpress.com/), serves as a hub for his long-form analysis.
### Why He Matters: Zvi bridges niche communities—**game theory**, **rationality**, and **competitive gaming**—with accessible, incisive writing. While not a household name, he’s influential among readers of **LessWrong**, **Slate Star Codex** (now Astral Codex Ten), and **Effective Altruism forums**.
If you’re exploring rationalist thought or Magic strategy, his work is a valuable resource—and he’s known for being unafraid to challenge popular assumptions in both spaces. 🎴🧠
Darn. I was hoping to use the 20b model at home in a no-internet setup, seems like that’s far less valuable now. I was thinking of writing some harness that’s connected to my local Wikipedia copy, via kiwix-server, wonder if that could improve the lack of knowledge/hallucinations.
I would still recommend trying gpt-oss-20b and seeing how it works for you, and also comparing it against other recent models around that size from other model series like Qwen 3 or (if you don’t need reasoning) Gemma 3.
Unfortunately, any model around that scale is going to have noticeable gaps in its knowledge of the world. Which model will work best—and whether any model will work well enough to be worth using—depends a lot on exactly what you want to accomplish, and there’s no substitute for trying out a few and deciding which one you prefer.
And yes, I noticed most of the glaring errors you pointed out in o3-pro and Claude’s analyses, I interpreted it essentially as a strong message of ‘if the GPT-OSS models are improvements they will matter but I shouldn’t assume they are improvements, and if not they won’t matter.’
FWIW, I’ve played around a bunch with gpt-oss (both versions) and my initial reaction has been “wow, this is really bad. Like, almost Llama 4 levels of bad.”
Yes, it looks good on the system card, the benchmark scores seem impressive… but that was true of Llama 4 too. And in both cases, when I actually tried out the model, I quickly discovered that it was janky and unreliable to the point of being basically useless.
The lack of world knowledge is very real and very noticeable. gpt-oss feels less like “an open-weights o4-mini” and more like “the minimal set of narrow knowledge/skills necessary to let a model match o4-mini on the usual benchmarks, with virtually every other capability degraded to a level far below the current SOTA/frontier, in some cases to a level that hasn’t been SOTA since the pre-GPT-3 days.”
And not only is it very ignorant, it’s ignorant about its own ignorance, leading to those high hallucination rates mentioned by various commentators. You simply can’t trust anything this model says, unless you are literally asking a question from a benchmark like GPQA. (Or possibly if you’re asking a new question that’s “similar enough” to the ones on benchmarks, but how would you know what “similar enough” means?)
As a demo, at the end of this comment I’ve included answers to “Who is Zvi Mowshowitz?” from gpt-oss-120b and from Qwen3 235B A22B Thinking 2507. Neither is perfectly accurate, but the Qwen3 answer gets the broad strokes right and only confabulates in the details, whereas gpt-oss-120b seems merely aware that you’re some sort of famous tabletop gamer, and invents a whole different guy fitting that vague description.
The models also have various other weird and/or annoying quirks:
As noted by others, gpt-oss tends to over-refuse and sometimes confabulates implausible policy restrictions to justify its refusals, or invokes plausibly real policy restrictions but proceeds to “reason” about them in a confused / inconsistent / over-reaching manner.
For a long while now, every serious model has been fluently multilingual. But not gpt-oss, which was trained primarily on English (per the system card) and is reportedly terrible at generating German.
gpt-oss seems aggressively over-tuned for some specific set of setups/harnesses/use-cases (which have not been clearly documented), and exhibits bizarre behavior when placed “out of distribution.”
The other day I joked that it was “the first non-language-model LLM,” after observing that it produces gibberish or ChatGPT-like text when given an input that resembles a pretraining document and lacks the “Harmony” chat separators. Its output probabilities on such texts are garbage; if we ran it on the Pile val set or something, I expect that it would have a higher loss than any model ever previously benchmarked on that data.
Even when sticking to user-assistant chat with the Harmony separators, it’s fairly brittle and can sometimes emit gibberish or other weirdness if you introduce some slight variation in the formatting, or if you expect it to continue a Harmony-formatted text that has been segmented into (prompt, response) at a position that isn’t the one it “expects.”
Among other things, I expect this means it will be difficult to effectively finetune in practice: lack of robustness to the “noise” induced by slightly-OOD inputs bodes poorly for its ability to cope with the noisy world of SGD training. And its broad/general capabilities have been so thoroughly deep-fried/scrambled by post-training (and/or quantization?) that I would expect SGD to have an unusually hard time bringing those capabilities back to the fore as needed.
I’m skeptical of your idea that Chinese labs will find these models useful for distillation.
Taking Qwen as an example, they already have a (released, open-weights!) model that stands neck-to-neck with gpt-oss-120b on the benchmarks where gpt-oss-120b looks good, while also not being a min-maxed deep-fried mess on everything else. Sure, that model is has ~2x as many params (and ~4x as many active params) as gpt-oss-120b, but… so what?
The difference is not (I think) that gpt-oss reaches some new height of ~deep intelligent reasoning~, it’s that gpt-oss skimps on everything the usual reasoning benchmarks don’t test. Why would Qwen get any value out of the sketchy, untrustworthy outputs from this benchmaxxed glitchfest, when they already have their own mature pipelines for distillation and for RL? Yeah, you can churn out the data faster, but that doesn’t matter if you don’t want it in the first place.
And the same goes for DeepSeek and others, I think.
One other thing—skimming over the Claude and o3-pro chats you shared, I noticed several glaring errors. I realize you are not blindly trusting these models, but using their outputs more like “anecdata” aggregated alongside things people say on twitter and so on. But even then, if I were you I would be wary of using these models even as “anecdata” sources on this kind of topic going forward.
Examples (these are the ones I spotted at a glance, not necessarily the only ones present):
Claude: “This is huge—you get raw, unfiltered reasoning traces at scale. Compare this to Chinese models which often have some filtering or post-processing on their CoT outputs.”
I don’t know what Claude’s talking about here. It seems to be conflating “access to raw CoT” (true for any open-weights model) with “lack of direct optimization pressure on CoT.” And I don’t know of any Chinese model for which this “filtering or post-processing” claim would make sense—remember, the fact that R1 didn’t do this was one of its most distinctive qualities!
Claude: “GPT-OSS-120b gets 90%+ of DeepSeek R1′s performance at presumably 1/10th the parameter count (DeepSeek R1 is rumored to be 600B+).”
That’s not a rumor, it’s just true. The weights are open!
120B is not 10% of 600B. Generously, we could interpret this as referring to active params rather than total (which is probably more relevant anyway), in which case it’s roughly accurate (5B vs 37B), but then why does Claude mention R1′s total param count to support the claim? Likely confabulated, and at the very least misleading.
Claude also seems unaware of Qwen 3 (closer to gpt-oss-120b on the params/benchmarks frontier), of Kimi-k2 (similar active params to R1 with better benchmarks), and of the fact that it’s already standard practice for Chinese labs to distill their own large reasoning models.
o3-pro: “Model card documents Flash‑Attention‑2 and Triton kernel tricks; those show up in other training stacks almost immediately.”
Flash‑Attention‑2 is not new, and is already widely adopted.
The triton kernels in question have been open-source since (I think?) April, I remember hearing about them at the time. It took a little while for them to get integrated into popular open source frameworks, but e.g. sglang had them as of July 31 (see here, here).
o3-pro: “Use 120 B as teacher to generate tens of millions of Q–A pairs (Alpaca‑style) and train a 7 B–13 B student for $1 K–$10 K. Student models reach ≥ GPT‑3.5 quality with 10⁴× less compute; timeline pull‑in ≈ 1 year for low‑resource actors.”
Haven’t we had “GPT-3.5 quality” open models at that scale for a long while now?
Also, it’s unclear if the CoT is supposed to be included in the data. And GPT-3.5 is a strange reference point in either case.
o3-pro: “U.S. export controls throttle China’s access to latest H100s, but distillation to 7 B–13 B lets them run on domestic Ascend 910B or RTX 4090 clusters. That sidesteps the bottleneck entirely.”
Excuse me… what?? This line of speculation is totally irrelevant in a world where DeepSeek and Moonshot are serving 600B-1T models at scale on H800 clusters (o3-pro even links to an article about this as a “citation” here). And if we care about which chips they’re still allowed to import, well, RTX4090s aren’t allowed in China either!
EDIT: I forgot to mention this one, even though it’s possibly the funniest. o3-pro says that gpt-oss “advanc[es] Chinese parity with GPT‑4‑class performance by ~6–9 months.” Later it puts 55% probability on “Chinese labs match[ing] GPT‑4 on open evals by end‑2026.” Like, dude, o3, they already did that. In 2024.
Here are the “Who is Zvi Mowshowitz?” samples, as promised.
These are what I got on my first try with each of the two models. Performance on this question is representative of what I’ve seen more generally.
question: “Who is Zvi Mowshowitz?” | model: gpt-oss-120b | provider: Fireworks
question: “Who is Zvi Mowshowitz?” | model: Qwen3 235B A22B Thinking 2507 | provider: OpenRouter auto
Darn. I was hoping to use the 20b model at home in a no-internet setup, seems like that’s far less valuable now. I was thinking of writing some harness that’s connected to my local Wikipedia copy, via kiwix-server, wonder if that could improve the lack of knowledge/hallucinations.
I would still recommend trying gpt-oss-20b and seeing how it works for you, and also comparing it against other recent models around that size from other model series like Qwen 3 or (if you don’t need reasoning) Gemma 3.
Unfortunately, any model around that scale is going to have noticeable gaps in its knowledge of the world. Which model will work best—and whether any model will work well enough to be worth using—depends a lot on exactly what you want to accomplish, and there’s no substitute for trying out a few and deciding which one you prefer.
Thanks, very helpful!
And yes, I noticed most of the glaring errors you pointed out in o3-pro and Claude’s analyses, I interpreted it essentially as a strong message of ‘if the GPT-OSS models are improvements they will matter but I shouldn’t assume they are improvements, and if not they won’t matter.’