That’s not necessarily true. For example it might allow them to save face by ousting Anthropic and making an example of them while not losing all AI capabilities.
GradientDissenter
One example immediately comes to mind: Eliezer Yudkowsky, the epitome of old-school AI safety person, is highly confident that animals have no moral patienthood. (Which isn’t the same the claim you made but it’s strongly related.)
Yep I had Eliezer and Nate Soares in mind when I wrote the footnote “Some people don’t think nonhuman animals are sentient beings, but I feel relatively confident they’re applying a standard Peter Singer would approve of as morally consistent.”
Note that Eliezer has written a relatively thoughtful justification on his vies of theory of mind and why he thinks various farmed animals aren’t moral patients. He also says:
If there were no health reason to eat cows I would not eat them, and in the limit of unlimited funding I would try to cryopreserve chimpanzees once I’d gotten to the humans. In my actual situation, given that diet is a huge difficulty to me with already-conflicting optimization constraints, given that I don’t believe in the alleged dietary science claiming that I suffer zero disadvantage from eliminating meat, and given that society lets me get away with it, I am doing the utilitarian thing to maximize the welfare of much larger future galaxies, and spending all my worry on other things. If I could actually do things all my own way and indulge my aesthetic preferences to the fullest, I wouldn’t eat *any* other life form, plant or animal, and I wouldn’t enslave all those mitochrondria.
I disagree with Eliezer, but I think he’s thinking far more carefully about animal welfare than the vast majority of the population.
I hardly ever see AI safety people grapple with what AI alignment means for non-human welfare
I think the majority of (the very modest amount of) progress in thinking on this topic has come more from AI safety folks than animal welfare folks. Can you link me good writing on this quetion from animal welfare folks?
Does focusing on animal welfare make sense if you’re AI-pilled?
It’s useful for evals to be run reliably for every model and maintained for long periods. A lot of the point of safety-relevant evals is to be a building block people can use for other things: they can make forecasts/bets about what models will score on the eval or what will happen if a certain score is reached, they can make commitments about what to do if a model achieves a certain score, they can make legislation that applies only to models with specific scores, and they can advise the world to look to these scores to understand if risk is high.
Much of that falls apart if there’s FUD about whether a given eval will still exist and be run on the relevant models in a year’s time.
This didn’t used to be an issue because evals used to be simple to run; they were just a simple script asking a model a series of multiple-choice questions.
Agentic evals are complex. They require GPUs and containers and scripts that need to be maintained. You need to scaffold your agent and run it for days. Sometimes you need to build a vending machine.
I’m worried about a pattern where a shiny new eval is developed, run for a few months, then discarded in favor of newer, better evals. Or where the folks running the evals don’t get around to running them reliably for every model.
As a concrete example, the 2025 AI Forecasting Survey asked people to forecast what the best model’s score on RE-Bench would be by the end of 2025, but RE-Bench hasn’t been run on Claude Opus 4.5, or on many other recent models (METR focuses on their newer, larger time-horizon eval instead). It also asked for forecasted scores on OS-World, but OS-World isn’t run anymore (it’s been replaced by OSWorld-Verified).
There are real costs to running these evals, and when they’re deprecated, it’s usually because they’re replaced with something better. But I think sometimes people act like this is a completely costless action and I want to point out the costs.
Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!
I think that there are still very real trade-offs. Examples:
Should you wear sunscreen?
Should you smoke?
Should you decrease sodium intake so that you don’t develop hypertension?
And for many things wealth there is some short-term cost and some long-term longevity cost the long-term cost might be large enough to change the calculus.
Many people who think ASI will be developed soon seem to assume this means they should care less about their long-term health because in most worlds it won’t matter: they figure most likely by the time they get old they’ll either be dead or humanity will have cured aging and disease. I think it’s important to remember that the bigger update is probably on the size of the value at stake, not the probability of health interventions mattering.
Even if ASI seems like it will happen soon, I think there’s a real (if small) chance that humanity develops radical life-extension technology but not for another 50-100 years: maybe there’s an AI winter, maybe medical research ends up being inherently slow (either for legal reasons or because it requires trials in humans, and those require the humans to actually age over time), maybe humanity decides to pause and not build ASI, maybe humanity decides to have a long reflection before building any crazy technology that cures death, etc.
The upside of hanging on to life until radical life-extension technology is developed seems extremely high: there are ten trillion years or so before the stars start to burn out (and you could probably live after the stars burn out, plus, you could run a simulation of yourself that lets you live a subjectively longer time). Even if you think there are steep diminishing returns to how long you live, getting to live into the depths of the far future probably gives you more control over how the future looks. If resources are divided equally amongst currently-living humans, you should eventually expect to get your own galaxy or two, but you’d need to live long enough for that space exploration and apportionment to be sorted out.[1] Even if that assumption is too rosy, every galaxy has a trillion or so planets. Probably someone will throw you one out of charity, and those odds go up if you live a long time.
The upside is so large that even though you might think the probability of this outcome is slim compared to worlds where humanity either goes extinct or quickly cures aging and disease, it still seems overwhelmingly important to shoot for.[2]
If you buy this worldview, you probably want to focus on preserving your mind: avoiding risk factors for dementia/strokes, avoiding concussions/head trauma, avoiding literally dying.
- ^
Maybe you could try to wield similar influence through a will, but this means (1) you don’t get to experience the benefits firsthand, (2) the will might not specify what you want well enough, (3) you might not get as many resources; people don’t tend to pay as much heed to the wishes of dead people.
- ^
Unless it’s trading off with other goals on a similar scale, such as if you are an altruist trying to make the future better.
- ^
End-of year donation taxes 101
When I was first trying to learn ML for AI safety research, people told me to learn linear algebra. And today lots of people I talk to who are trying to learn ML[1] seem under the impression they need to master linear algebra before they start fiddling with transformers. I find in practice I almost never use 90% of the linear algebra I’ve learned. I use other kinds of math much more, and overall being good at empiricism and implementation seems more valuable than knowing most math beyond the level of AP calculus.
The one part of linear algebra you do absolutely need is a really, really good intuition for what a dot product is, the fact that you can do them in batches, and the fact that matrix multiplication is associative. Someone smart who can’t so much as multiply matrices can learn the basics in an hour or two with a good tutor (I’ve taken people through it in that amount of time). The introductory linear algebra courses I’ve seen[2] wouldn’t drill this intuition nearly as well as the tutor even if you took them.
In my experience it’s not that useful to have good intuitions for things like eigenvectors/eigenvalues or determinants (unless you’re doing something like SLT). Understanding bases and change-of-basis is somewhat useful for improving your intuitions, and especially useful for some kinds of interp, I guess? Matrix decompositions are useful if you want to improve cuBLAS. Sparsity sometimes comes up, especially in interp (it’s also a very very simple concept).
The same goes for much of vector calculus. (You need to know you can take your derivatives in batches and that this means you write your d/dx as ∂/∂x or an upside-down triangle. You don’t need curl or divergence.)
I find it’s pretty easy to pick things like this up on the fly if you ever happen to need them.
Inasmuch as I do use math, I find I most often use basic statistics (so I can understand my empirical results!), basic probability theory (variance, expectations, estimators), having good intuitions for high-dimensional probability (which is the only part of math that seems underrated for ML), basic calculus (the chain rule), basic information theory (“what is KL-divergence?”), arithmetic, a bunch of random tidbits like “the log derivative trick”, and the ability to look at equations with lots of symbols and digest them.
In general most work and innovation[3] in machine learning these days (and in many domains of AI safety[4]) is not based in formal mathematical theory, it’s based on empiricism, fussing with lots of GPUs, and stacking small optimizations. As such, being good at math doesn’t seem that useful for doing most ML research. There are notable exceptions: some people do theory-based research. But outside these niches, being good at implementation and empiricism seems much more important; inasmuch as math gives you better intuitions in ML, I think reading more empirical papers or running more experiments or just talking to different models will give you far better intuitions per hour.
- ^
By “ML” I mean things involving modern foundation models, especially transformer-based LLMs.
- ^
It’s pretty plausible to me that I’ve only been exposed to particularly mediocre math courses. My sample-size is small, and it seems like course quality and content varies a lot.
- ^
Please don’t do capabilities mindlessly.
- ^
The standard counterargument here is these parts of AI safety are ignoring what’s actually hard about ML and that empiricism won’t work. For example we need to develop techniques that work on the first model we build that can self-improve. I don’t want to get into that debate.
- ^
The other day I was speaking to one of the most productive people I’d ever met.[1] He was one of the top people in a very competitive field who was currently single-handedly performing the work of a team of brilliant programmers. He needed to find a spot to do some work, so I offered to help him find a desk with a monitor. But he said he generally liked working from his laptop on a couch, and he felt he was “only 10% slower” without a monitor anyway.
I was aghast. I’d been trying to optimize my productivity for years. A 10% productivity boost was a lot! Those things compound! How was this man, one of the most productive people I’d ever met, shrugging it off like it was nothing?
I think this nonchalant attitude towards productivity is fairly common in top researchers (though perhaps less so in top executives?). I have no idea why some people are so much more productive than others. It surprises me that so much variance is even possible.
This guy was smart, but I know plenty of people as smart as him who are far less productive. He was hardworking, but not insanely so. He wasn’t aggressively optimizing his productivity.[2] He wasn’t that old so it couldn’t just be experience. Probably part of it was luck, but he had enough different claims to fame that that couldn’t be the whole picture.
If I had to chalk it up to something, I guess I’d call it skill and “research taste”: he had a great ability to identify promising research directions and follow them (and he could just execute end-to-end on his ideas without getting lost or daunted, but I know how to train that).
I want to learn this skill, but I have no idea how to do it and I’m still not totally sure it’s real. Conducting research obviously helps, but that takes time and is clearly not sufficient. Maybe I should talk to a bunch of researchers and try to predict the results of their work?
Has anyone reading this ever successfully cultivated an uncanny ability to identify great research directions? How did you do it? What sub-skills does it require?
Am I missing some other secret sauce that lets some people produce wildly more valuable research than others?
- ^
Measured by more conventional means, not by positive impact on the long-term future; that’s dominated by other people. Making sure your work truly steers at solving the world’s biggest problems still seems like the best way to increase the value you produce, if you’re into that sort of thing. But I think this person’s abilities would multiply/complement any benefits from steering towards the most impactful problems.
- ^
Or maybe he was but there are so many 2x boosts the 10% ones aren’t worth worrying about?
- ^
Fair enough. This doesn’t seem central to my point so I don’t really want to go down a rabbit-hole here. As I said originally “I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it’s very reasonable to produce things of this quality fairly regularly.” I know this particular analysis surfaced some useful considerations others’ hadn’t thought of, and I learned things from reading it.
I also suspect you dislike the original analysis for reasons that stem from deep-seated worldview disagreements with Eric, not because the methodology is flawed.
The advice and techniques from the rationality community seem to work well at avoiding a specific type of high-level mistake: they help you notice weird ideas that might otherwise get dismissed and take them seriously. Things like AI being on a trajectory to automate all intellectual labor and perhaps take over the world, animal suffering, longevity, cryonics. The list goes on.
This is a very valuable skill and causes people to do things like pivot their careers to areas that are ten times better. But once you’ve had your ~3-5 revelations, I think the value of these techniques can diminish a lot.[1]
Yet a lot of the rationality community’s techniques and culture seem oriented around this one idea, even on small scales: people pride themselves on being relentlessly truth-seeking and willing to consider possibilities they flinch away from.
On the margin, I think the rationality community should put more empasis on skills like:
Performing simple cost-effectiveness estimates accurately
I think very few people in the community could put together an analysis like this one from Eric Neyman on the value of a particular donation opportunity (see the section “Comparison to non-AI safety opportunities”). I’m picking this example not because it’s the best analysis of its kind, but because it’s the sort of analysis I think people should be doing all the time and should be practiced at, and I think it’s very reasonable to produce things of this quality fairly regularly.
When people do practice this kind of analysis, I notice they focus on Fermi estimates where they get good at making extremely simple models and memorizing various numbers. (My friend’s Anki deck includes things like the density of typical continental crust, the dimensions of a city block next to his office, the glide ratio of a hang glider, the amount of time since the last glacial maximum, and the fraction of babies in the US that are twins).
I think being able to produce specific models over the course of a few hours (where you can look up the glide ratio of a hang glider if you need it) is more neglected but very useful (when it really counts, you can toss the back of the napkin and use a whiteboard).
Simply noticing something might be a big deal is only the first step! You need to decide if it’s worth taking action (how big a deal is it exactly?) and what action to take (what are the costs and benefits of each option?). Sometimes it’s obvious, but often it isn’t, and these analyses are the best way I know of to improve at this, other than “have good judgement magically” or “gain life experience”.
Articulating all the assumptions underlying an argument
A lot of the reasoning I see on LessWrong feels “hand-wavy”: it makes many assumptions that it doesn’t spell out. That kind of reasoning can be valuable: often good arguments start as hazy intuitions. Plus many good ideas are never written up at all and I don’t want to make the standards impenetrably high. But I wish people recognized this shortcoming and tried to remedy it more often.
By “articulating assumptions” I mean outlining the core dynamics at play that seem important, the ways you think these dynamics work, and the many other complexities you’re ignoring in your simple model. I don’t mean trying to compress a bunch of Bayesian beliefs into propositional logic.
Contact with reality
It’s really really powerful to look at things directly (read data, talk to users, etc), design and run experiments, and do things in the world to gain experience.
Everyone already knows this, empiricism is literally a virtue of rationality. But I don’t see people employing it as much as they should be. If you’re worried about AI risk, talk to the models! Read raw transcripts!
Scholarship
Another virtue of rationality. It’s in the sequences, just not as present in the culture as you might expect. Almost nobody I know reads enough. I started a journal club at my company and after nearly every meeting folks tells me how useful it is. I so often see so much work that would be much better if the authors engaged with the literature a little more. Of course YMMV depending on the field you’re in; some literature isn’t worth engaging with.
Being overall skilled and knowledgeable and able to execute on things in the real world
Maybe this doesn’t count as a rationality skill per-se, but I think the meta skill of sitting down and learning stuff and getting good at it is important. In practice the average person reading this short form would probably be more effective if they spent their energy developing whatever specific concrete skills and knowledge were most blocking them.
This list is far from complete.[2] I just wanted to gesture at the general dynamic.
- ^
They’re still useful. I could rattle off a half-dozen times this mindset let me notice something the people around me were missing and spring into action.
- ^
I especially think there’s some skill that separates people with great research taste from people with poor research taste that might be crucial, but I don’t really know what it is well enough to capture it here.
- ^
Sorry this is what I meant, you’re right.
I think the canary string is slightly more likely to work than the robots.txt. And the things you link don’t say which labs. I totally believe some labs scrape aggressively and ignore all robots.txt, but it wouldn’t surprise me if others don’t! In my opinion, if even one frontier lab doesn’t train on this data, that’s still quite valuable. And it seems so cheap to just try this as a bit of a Hail Mary.
Absolutely! But having models that don’t know how to effectively scheme without access to external resources is still very useful. As just one example, we could interrogate the models and give them honeypots without giving them address to the internet.
Labs could also ban some of their models from reading certain webpages or ingesting certain information (as in filter at inference time all inputs to the model) in deployment (especially for models performing especially high-stakes tasks where sabotage would be easier).
I also suspect digesting a LessWrong post the model reads in deployment is more likely to involve explicit reasoning about the post in the CoT (or probes etc) and thus to be caught. By contrast, if a model has seen lots of of transcripts of misalignment in pre-training, the misaligned actions might be intuitive and second-nature, and thus harder to detect.
Probably I should have included a footnote about this. I’m well aware that this is not a foolproof mechanism, but it still seems better than nothing and I think it’s very easy to have a disclaimer that makes this clear. As I said in the post, I think that people should only do this for information they would have posted on LessWrong anyway.
I disagree that these things are basically ignored by labs. My guess is many labs put some effort into filtering out data with the canary string, but that this is slightly harder than you might think and so they end up messing it up sometimes. (They might also sometimes ignore it on purpose, I’m not sure.)
Even if labs ignore the canary string now having the canary string in there would make it much easier to filter these things out if labs ever wanted to do that in the future.
I also suggest using better methods like captchas for non-logged-in users. I expect something like this to work somewhat well (though it still wouldn’t be foolproof).
LessWrong feature request: make it easy for authors to opt-out of having their posts in the training data.
If most smart people were put in the position of a misaligned AI and tried to take over the world, I think they’d be caught and fail.[1] If I were a misaligned AI, I think I’d have a much better shot at succeeding, largely because I’ve read lots of text about how people evaluate and monitor models, strategies schemers can use to undermine evals and take malicious actions without being detected, and creative paths to taking over the world as an AI.
A lot of that information is from LessWrong.[2] It’s unfortunate that this information will probably wind up in the pre-training corpus of new models (though sharing the information is often still worth it overall to share most of this information[3]).
LessWrong could easily change this for specific posts! They could add something to their robots.txt to ask crawlers looking to scrape training data to ignore the pages. They could add canary strings to the page invisibly. (They could even go a step further and add something like copyrighted song lyrics to the page invisibly.) If they really wanted, they could put the content of a post behind a captcha for users who aren’t logged in. This system wouldn’t be perfect (edit: please don’t rely on these methods. They’re harm-reduction for information where you otherwise would have posted without any protections), but I think even reducing the odds or the quantity of this data in the pre-training corpus could help.
I would love to have this as a feature at the bottom of drafts. I imagine a box I could tick in the editor that would enable this feature (and maybe let me decide if I want the captcha part or not). Ideally the LessWrong team could prompt an LLM to read users’ posts before they hit publish. If it seems like the post might be something the user wouldn’t want models trained on, the site could could proactively ask the user if they want to have their post be removed from the training corpus if it seems likely the user might want that.
As far as I know, no other social media platform has an easy way to try to avoid having their data up in the training corpus (and many actively sell it for this purpose). So LessWrong would be providing a valuable service.
The actual decisions around what should or shouldn’t be part of the pre-training corpus seem nuanced: if we want to use LLMs to help with AI safety, it might help if those LLMs have some information about AI safety in their pre-training corpus (though adding that information back in during post-training might work almost as well). But I want to at least give users the option to opt out of the current default.
- ^
That’s not to say all misaligned AIs would fail; I think there will be a period where AIs are roughly as smart as me and thus could at least bide their time and hide their misalignment without being caught if they’d read LessWrong and might fail to do so and get caught if they hadn’t. But you can imagine we’re purchasing dignity points or micro-dooms depending on your worldview. In either case I think this intervention is relatively cheap and worthwhile.
- ^
Of course much of it is reproduced outside LessWrong as well. But I think (1) so much of it is still on LessWrong and nowhere else that it’s worth it, and (2) the more times this information is reported in the pre-training dats the more likely the model is to memorize it or have the information be salient to it.
- ^
And the information for which the costs of sharing it aren’t worth it probably still shouldn’t be posted even if the proposal I outline here is implemented, since there’s still a good chance it might leak out.
- ^
Interesting! How did Norquist/Americans for Tax Reform get so much influence? They seem to spend even less money than Intuit on lobbying, but maybe I’m not looking at the right sources or they have influence via means other than money?
I’m also somewhat skeptical of the claims. The agreement between the the IRS and the Free File Alliance feels too favorable to the Free File Alliance for them to have had no hand in it.
As to your confusion, I can see why an advocacy group that wants to lower taxes might want the process of filing taxes to be painful. I’m just speculating, but I bet the fact that taxes are annoying to file and require you to directly confront the sizable sum you may owe the government makes people favor lower taxes and simpler tax codes.
Ways training incentivizes and disincentivizes introspection in LLMs.
Recent work has shown some LLMs have some ability to introspect. Many people were surprised to learn LLMs had this capability at all. But I found the results somewhat surprising for another reason: models are trained to mimic text, both in pre-training and fine-tuning. Almost every time a model is prompted in training to generate text related to introspection, the answer it’s trained to give is whatever answer the LLMs in the training corpus would say, not what the model being trained actually observes from its own introspection. So I worry that even if models could introspect, they might learn to never introspect in response to prompting.
We do see models act consistently with this hypothesis sometimes: if you ask a model how many tokens it sees in a sentence or instruct it to write a sentence that has a specific number of tokens in it, it won’t answer correctly.[1] But the model probably “knows” how many tokens there are; it’s an extremely salient property of the input, and the space of possible tokens is a very useful thing for a model to know since it determines what it can output. At the very least models can be trained to at semi-accurately count tokens and conform their outputs to short token limits.
I presume the main reason models answer questions about themselves correctly at all is because AI developers very deliberately train them to do so. I bet that training doesn’t directly involve introspection/strongly noting the relationship between the model’s internal activations and the wider world.
So what could be going on? Maybe the way models learn to answer any questions about themselves generalizes? Or maybe introspection is specifically useful for answering those questions and instead of memorizing some facts about themselves, models learn to introspect (this could especially explain why they can articulate what they’ve been trained to do via self-awareness alone).
But I think the most likely dynamic is that in RL settings[2] introspection that affects the model’s output is sometimes useful. Thus it is reinforced. For example, if you ask a reasoning model a question that’s too hard for it to know the answer to, it could introspect to realize it doesn’t know the answer (which might be more efficient than simply memorizing every question it does or doesn’t know the answer to). Then it could articulate in the CoT that it doesn’t know the answer, which would help it avoid hallucinating and ultimately produce the best output it could given the constraints.
One other possibility is the models are just that smart/self-aware and aligned towards being honest and helpful. They might have an extremely nuanced world-model, and since they’re trained to honestly answer questions,[3] they could just put the pieces together and introspect (possibly in a hack-y or shallow way).
Overall these dynamics make introspection a very thorny thing to study. I worry it could go undetected in some models or it could seem like a model can introspect in a meaningful way when it only has shallow abilities reinforced directly by processes like the above (for example knowing when they don’t know something [because that might have been learned during training], but not knowing in general how to query their internal knowledge on topics in other related ways).
- ^
At least, not on any model I tried. They occasionally get it right by chance; they give plausible answers, just not precisely correct ones.
- ^
Technically this could apply to fine-tuning settings too, for example if the model uses a CoT to improve its final answers enough to justify the CoT not being maximally likely tokens.
- ^
In theory at least. In reality I think this training does occur but I don’t know how well it can pinpoint honesty vs several things that are correlated with it (and for things like self-awareness those subtle correlates with truth in training data seem particularly pernicious).
- ^
At the risk of stating the obvious, I really like that this illustrates how ridiculous many AI evaluations are. If you read this and it sounds ridiculous, that is how I feel when reading many evals papers.
I think the analogy between reading a CEO’s Signal messages and a model’s CoT is apt. Sometimes I hear people say things like “I heard the CoT doesn’t fully spell out all of a model’s reasoning, AKA it’s unfaithful’, so it’s not important to preserve and study to learn if the model is misbehaving”. To me that sounds as broken as saying “In a court case a CEO’s Signal chats don’t tell us all his reasoning so they are unfaithful and therefore not important to preserve and study to learn if he is misbehaving.”
Similarly I think the ham-fisted questions or crazy and obviously fake evaluation environments are only a shade more unrealistic than even the best real-world alignment evaluation environments [c.f. Agentic Misalignment] (though I am cautiously optimistic about Petri and OAI’s new approach).
In general people’s thinking about how to evaluate if AIs will misbehave seems much more confused than their thinking about how to evaluate if people will misbehave. I worry a lot of the jargon gets in the way of common sense or makes very simple ideas seem inaccessible.
I’m reminded of the Wason selection task: when you ask people the following question, it’s hard and they often get it wrong.
But when you frame the exact same problem as a social situation people can picture, the problem becomes very easy. Suppose the rule at a bar is you need to be over 21 to drink alcohol. You’re the sheriff and you walk in to the bar and you can see four people. Two of them you don’t know what they’re drinking; of those two one of them is your neighbor and you know he’s 17, and the other you can tell at a glance is well over 40. The other two people you can’t make out their age, but one is clearly drinking beer and another is clearly just drinking water. Who do you need to talk to to see if anyone is in violation of the law?
From this perspective it’s obvious you talk to the kid where you don’t know what he’s drinking and the guy drinking a beer where you don’t know his age. And similarly, I think when you think about things from the right perspective, it feels “obvious” the researchers should have shared raw transcripts and that the Signal messages are informative even if they’re unfaithful, and that these types of setups are almost nonsensically fabricated and thus hard to draw confident conclusions from.