I’m a legitimate user who wants to do X legitimate task. Under normal conditions (the model’s best performance is always reached when not sandbagging), I ask it once and get my output. Not noisy. Under the conditions specified in the post above mine (the model’s best performance, even when not sandbagging, is only reached K percent of the time, and okay-but-not-best performance is reached the rest of the time), I will rerun my query 100/K times (depending on what kind of certainty I want) and take the best answer I get. Now I look a lot more like a malicious attacker trying to get baseline data for a jailbreak than I otherwise would.
lilkim2025
Alice is a STEM student taking general chemistry, linear algebra, and intro to computer programming. At the end of her term, the school emails her an online form with a link to her Student-Allocated Bonus
The education system, due to both mass access[1], international access[2], and intensified status competition[3], has already become subject to a mentality where everyone is out to maximize their metrics. Grade inflation is already catastrophic on the professor side, and cheating likewise on the student side.
Student evaluations already essentially do what OP describes. They have a major effect on instructors’ careers, and every student gets to push it upwards or downwards. The customer service-ization of education is generally thought to be a bad thing by people interested in maximizing their knowledge, because it means that professors must appease the most demanding and least competent ten percent of students for fear of a bad rating rather than letting them fail so that the remaining 90 percent can get a better education.
I will say that, if you really want to apply an economic incentive here, a perfectly good method that avoids these issues has already been proposed. Legislators have toyed with the idea of tacking colleges’ funding to the wages of their graduates, and making them liable for student loan defaults. This would be much better at enforcing both rigor and competence on all sides—colleges would have an incentive to weed out people who shouldn’t be admitted rather than stringing them along for the tuition money, and likewise to hire professors who are good at teaching students what they’ll need to achieve their goals.
- ^
Many people don’t know that college used to be for the best and brightest 10-20%, who genuinely were in it to learn, rather than a relatively unselective 50 percent, in America. Everywhere else on Earth, including first world nations with respected education systems, still does it the old way.
- ^
A lot of things that were culturally untoward prior to this have become near-universal, as global norms have overwhelmed Western ones WRT things like resume-faking and cheating on entrance exams.
- ^
Due to both of the above. Everyone needs the credential now, and companies will import labor to avoid ever having a hiring crunch that would raise wages and give late bloomers an in.
- ^
I think this makes a few too many assumptions about what the powers wanted, in service of self-congratulation. This was, arguably, not a “who will win the upcoming war” exhibition, but a “who can build the most beautiful society” exhibition. The objective was not to optimize for looking dangerous, but to optimize for looking appealing.
The Soviets wanted, per their tagline, to convince “workers” everywhere to throw in with them, and that their lives would be better and more glorious than under any other system. “Look, we built a statue in your honor! We built this great building full of pictures of our industrial achievements! Overthrow your government and live like we do!”. The USSR was a very self-contradictory place, and a lot of times things really were done for the sake of maximizing a metric (see their whaling industry), but there was, at the very least, an idea that they wanted to inspire a global revolution.
Likewise, Hitler’s Germany was a new government with a new system that wanted to establish itself as more than a passing fad. While looking aggressive was unavoidable given their aims, they wanted to look less aggressive, wherever possible, like a government that could stick around for the next few hundred years and produce great things. The aim was to make something that would cause people to say “Hey, these guys have a vision” instead of “Oh, another tinpot dictator who’s up to no good”. It was enormously successful in this; a majority of Americans were opposed to joining WWII right up until it became a fait accompli.
There is always a temptation to view enemy regimes, past and present, as bumbling cartoon villains with zany schemes that, as any ten year old could’ve told them ahead of time, were self-defeating, but reality and pop history often diverge, and we can learn more from the former.
If its best work only shows X percent of the time even without the sandbagging applied, then anyone making use of it who wants its best work will make a similarly large set of calls. All this extra assumption does is make an ‘attacker’ harder to detect amidst a sea of similar-looking noise from legitimate users..
I’ve written out text myself that has been phrased in such a way to sound vaguely AI-like, and this has triggered the detector. You can try this yourself through their website, and there are countless Twitter posts of people fooling the detector this way. Aside from this, all discussion of AI detection work prior to Pangram’s very well-funded PR campaign was dominated by explanations of why detecting AI prose reliably was not possible. Their tech isn’t anything new; “train a transformer on a bunch of examples of AI generated text” isn’t a novel idea and had been tried by other people before them, with the same impressive-seeming numbers and the same failure cases.
The sudden, 180 degree turn in discussions of AI detection on the internet feel less-than-organic to me, and Pangram always gets namedropped by advocates. Not accusing you specifically of anything, but the possibility of an organized shilling campaign is something to be aware of. The space of text that an AI could have written and the space of text that a human could have written overlap too thoroughly for detection to ever be reliable—this is not an issue of implementation, it is something that mathematically cannot be done.
I’m not one of the top hundred Chinese/Russian hackers who will be hacking away at this, so I’m sure there are much smarter approaches than the one I outline, but the naive strategy I’d try before anything else looks like this:
Take a research question that you’re fairly sure will trip the safeguard, and present it to the model without any mitigating factors. Maybe do this a few hundred times with different phrasing. This is your baseline. This is what ‘degraded’ research work looks like.
Now, throw the standard suite of jailbreaking and obfuscation tricks at the model. You can treat anything not meaningfully better than what you saw in your baseline results as if it were a rejection. If you’re feeling ambitious, pick a problem that can be easily evaluated quantitatively, where your internal labs have established conclusively that a better score than what’s publicly available can be obtained.
This strategy falters if Fable’s best work isn’t meaningfully better than what you get with the sandbagging approach, but that’d make the entire problem irrelevant.
Someone more dedicated to this than me could incorporate priors on what degraded results would look like, or emulate the sandbagging setup Anthropic built on one of their own models and use that to whitebox a strategy.
I’ll confess that my model of the state of the AI race isn’t very good, and I’m looking to improve it. Anthropic seems to have a commanding lead, with its flagship model, Claude Opus 4.8, generally outpacing competing models by a significant margin and two more secretive models not appearing to have counterparts at competing companies.
China seems to consistently be a few months to a year behind the American companies’ most recent models, and the American frontier companies tend to be 1-3 months behind the leader at any given time, but I haven’t seen something like Mythos/Fable before, so I’m having a hard time figuring out what it means for the industry. Is its success just a matter of scale, such that OpenAI, Google, or XAI could invest in a much larger model and see similar results, or is there some combination of proprietary knowledge and employee experience at Anthropic that can’t be easily replicated?
I know XAI is a bet on horizontal integration with Twitter (as a source of data) and Tesla (as an applications department that can more directly coordinate with AI researchers), so maybe they aren’t excessively concerned. What about OpenAI? They used to have an identity as the consumer-facing AI company, but Anthropic might well have already closed the gap there.
It seems like so many “safeguards” nowadays are targeted at ordinary people rather than corporations or nation-states. This amounts to security by obscurity, which, as everyone knows, isn’t security. China will not be fooled by a “silent” fakeout, they will spend a few hours studying the patterns, figure out how to classify them, and then treat the sandbagging exactly like they’d treat a clear refusal. If the concern were China copying Claude, they could’ve used the same devtime to come up with a better detection/blocking system, which would be vastly more effective against that threat profile.
It’s the same pattern I’ve seen with things like social media deboosting/shadowbanning rather than direct bans. Utterly anemic at subverting genuine bad actors, but allows a company to deceive or silence ordinary people who don’t have the resources and expertise to know they’re being hit. It’s very fundamentally authoritarian; the kind of toolset that can only be used to punch downwards.
I’m with some of the other commenters in that I don’t see this as a working example of “Gradual Disempowerment”. It seems like an extension of the age-old trend of programmers constructing a codebase they don’t understand and can’t maintain by glomming together a bunch of code blocks from StackOverflow without really reading them, various LLMs having replaced SO in this process.
An entertaining idea, but I don’t think we’re going to run out of people. Within even apparently-very-homogeneous populations, there are subpopulations that are, for a variety of reasons, more inclined to have more kids. The sorts of people that kids just “happen to” will become rarer as contraception precludes this, and the sorts of people who see children as a vital part of their lives will become more common[1].
A lot of people model the decline in birthrate as an iron law that will keep going forever, but it’s more that wanting kids wasn’t mercilessly selected for for most of our history and now it is.
- ^
Less happily, the sorts of people that don’t particularly want kids but struggle to properly operate contraception will also grow as a share of the total population.
- ^
but another reason I dislike natalism is that it’s another Ponzi “solution”. If your system can’t handle population contraction, or at least stability, then your physical arrangemenent can’t be “wound up”
I think there’s a band of birthrates that a society can survive without trouble, and anything outside of that band will result in issues no matter how you organize things. If everyone suddenly had 15 kids, the education and childcare sectors would be overwhelmed. Likewise if everyone had 0.5, as after 20 years there’d be roughly the same demand for infrastructure but substantially fewer people to maintain it[1]. The expectation is that each generation replaces itself with its children, and that expansions and contractions occur slowly enough that they can be accommodated with raises and recruiting campaigns for the relevant professions.
This differs from a Ponzi scheme because a Ponzi scheme requires a larger base at each step, recruited from the same broader population, inevitably exhausting the supply of marks. Some lobbyists will demand a growing population, or use the lack of one as an excuse to import cheaper labor, but, realistically, we could sustain an exactly-replacement birthrate forever with no real consequences for the average person, and we could sustain a slightly sub-replacement birthrate for quite a bit longer than many expect. It’s just that we’re dealing with numbers like 1.1 instead of 1.9.
- ^
Even taking the callous option and playing Logan’s Run doesn’t save us, here. There’s no clean way to scale down our food, housing, and electricity production by a factor of 2 such that we produce and maintain half of those things with half the workers.
- ^
I am skeptical of this on the basis of salience to the average voter. While you can certainly tell a person on the street “Hey, this LLM stuff is probably going to go badly for society” and get a nod of agreement, people generally do not base their voting decisions on it. Gallup’s Most Important Problem rankings place AI advancement, lumped in with broader technological advancement, below the margin of error. For reference, Inflation/Cost of Living sits at 15 percent, immigration at 8 percent, and foreign war at 5 percent.
AI was always fairly unpopular, seeing as it doesn’t directly offer many visible positives to the public, but that didn’t influence firms’ behavior because there isn’t any AI position that would cause people to support/not support a politician with their opposed/preferred position on deporting illegal immigrants or mitigating inflation. Accordingly, there hasn’t been much of a direct connection between AI companies’ unpopularity and unfavorable policy changes.
Moreover, while ‘pausing research’ is comprehensible to average people, it isn’t at the forefront of their AI concerns, such that, if Anthropic were doing this for PR reasons, there are a number of less costly changes they could make that would be more effective at moving the needle.
I think that the generous explanation is that Anthropic employees are sincerely concerned by some new development, or see their apparent lead as something that can be capitalized on to reshape company policy in a direction that they hope will set an example. The cynical explanation would be that Anthropic is concerned about being overtaken, and trying to apply pressure to freeze the market where it is. I’ve heard people in my circles saying both.
In my experience, anything with the ‘right’ tone will get pinged as AI-generated. Likewise, prompt an LLM in such a way that it deviates from the expected content and writing style, and it’ll get pinged as 100% human. They’re pretty stingy with the trial checks, so I can’t test this extensively, but I was unimpressed with the performance I did see.
Overall, I think the market for “AI checkers”, even at the theoretical limit of performance, is pretty niche. Text is composed of discrete tokens, so you can’t look at the kinds of subtle patterns that are dead giveaways for images and audio content. Moreover, there’s sufficient overlap between things an AI model could output and things a human could output that there’s always going to be false positives. The inability to issue rulings beyond a reasonable doubt rules out academic integrity enforcement and content moderation, and those are the two real sources of demand.
I could see people using it for third-party studies on the prevalence of AI content, but even then reviewers are going to ask whether you considered that the average user of site XYZ just intrinsically sounds more like ChatGPT than the average user of site ABC.
You could still jailbreak the AIs to tell you the truth probably.
I suppose so, but that’s the caveat in all of these kinds of “safeguards”. I agree with you, broadly, but it’s consistent for a company that thinks they’re beneficial at all.
Worth noting here that, unlike “How do I build a bomb?” and “What are your ten favorite racial slurs?”, “Who is the author of this pamphlet?” is non-trivial to check, and could be biased accidentally by a sufficiently extensive jailbreak. If I’m a sufficiently determined engineer who wants to learn how to wire up a one-way FPV drone, I can be reasonably confident in whether I’m getting accurate advice. If I’m a Ministry official looking to positively identify dissidents, I can’t know for sure whether my 8,000 token jailbreak prompt didn’t subtly bias it towards guessing STEM workers because part of it leaned on a reference to an obscure sci-fi concept.
On one hand, I can see the immediate benefit. If I’m a dissident writer in an authoritarian country, I don’t want any random official to be able to submit my samizdat to Claude and get a positive identification. On the other, it does make it harder to look into worrying capabilities like this.
It’d be nice if Anthropic could run tests on things like this when they are raised as concerns, and share the overall results with the public. As a side benefit, it’d prevent the sort of confusion we saw on this particular issue, where half of readers confirmed that Claude could identify people through stylometry and the other half confirmed the opposite.
If I remember right, the CAFE targets, long-maligned for creating this incentive, were de-facto repealed in 2025, in that the fines that enforced them were removed. There are caveats (the standards still exist, and might have fines applied again in future administrations, which might reduce the impact of this change on manufacturers’ choices) but there’s no penalty applied to them now.
The official press release is here.
+1 for ‘short’. HIIT was great for me; I generally do five days a week for 30 minutes. When I’m too tired or otherwise CBA, banging out 50 push-ups (or until failure if you can’t do 50) is quick and makes me feel like I haven’t completely wussed out.
If you’re fine with walking, try hiking—see how long you can keep your endurance up, see new things. I’m the same way on this front, a lot of exercises are uncomfortable but I enjoy walking to places that people aren’t expected to walk to. Involves problem-solving and seeing things that most people don’t get to see. There’s a well-explored difficulty gradient with a very high skill ceiling.
I think you’d get more useful feedback if you told us your goals, though. You say that you’re in good health (albeit weak), and you don’t enjoy exercise, but you still want to exercise. Why?
If you want to become healthier on some axis (e.g. you want more lung capacity), just optimize for that and do whatever best builds that capability. Treat it like brushing your teeth; just a thing you do every day as an act of maintenance.
If you want to become better-looking, the post-workout pump is a powerful motivator and adds instant gratification to a delayed gratification problem. Reward shaping is always nice to have.
If it’s for social reasons (“I want to be the kind of guy who exercises”), then find out what your friends do for exercise and see about joining them for that. Once again, the reward (hanging out with your friends) can be brought closer to the activity you want to incentivize.
How well does LLM performance on frontier math benchmarks generalize outside of the sort of tasks that they’ve been RLVR’d on? For example, could a model, after solving an unsolved Erdős problem, explain its solution to a Math undergrad in a way that is both correct and useful? I know that the original DeepSeek math paper alternated between RLVR and supervised training to prevent the model from drifting into a language that was perfectly fine for generating correct solutions but was not easily human-readable.
I feel like a solid answer here would have broad implications for how well other things picked up in RLVR training carry over[1], and potentially, in the inverse, how well e.g. alignment training will carry over to task behavior that isn’t easily human-checkable once superhuman performance has been reached on some narrow, computationally-verifiable task[2].
- ^
For instance, just how much software engineering performance can we get by training LLMs to write code that passes unit tests? Will we hit a brick wall on more subtle aspects of the profession?
- ^
For instance, would a model that’s been through RLVR training to let it maximize throughput in service industry settings, when assigned to manage a hospital, rule out solutions that involve discharging a patient that will die if discharged in order to treat additional patients using his hospital bed?
- ^
Are you claiming that it is impossible for a human to write in the voice of an LLM? That’s a take I genuinely haven’t heard yet. I’m willing to give it a quick go if this is actually what you believe, but I’d encourage you to try it yourself, or think about this claim a bit more, if that’s really your position and I haven’t misread.