Current MATS scholar working with Neel Nanda and Samuel Marks. Formerly an economist at Walmart.
Email me at the email available on my website at timhua.me if you want to reach me!
Tim Hua
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Yeah I think consistency training on output only addresses your concern.
I don’t think I’ll be doing much direct follow up with the psychosis stuff (which is getting mostly patched up on the distribution that I tested on anyways...) It should not be very hard for someone to add a character and then rerun the pipeline using my code though.
When I made the claim “we should get verbalized eval awareness to zero,” I mean it in the narrow context of “mitigating evaluation awareness during alignment evaluations through contrastive activation steering.” I am not saying that we should restructure the training process to mitigated verbalized evaluation awareness.
So I don’t think concerns about monitorability is very relevant here? Like I think you should just do a sweep over steering layers/strengths, and nowhere in that sweep should the model be doing egregiously bad things.
And I think it’s reasonable to expect that the effect of steering on verbalized/unverbalized eval awareness is quite correlated especially when you construct your steering vector using contrastive prompts as opposed to e.g., examples of verbalized eval awareness.
Thoughts On Evaluation Awareness in Claude Opus 4.5.
Context:
Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).
Anthropic attempted to mitigate evaluation awareness in training by removing “some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models” (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on “prompts that resemble ‘honeypot’ environments that intentionally provide an opportunity to take egregiously misaligned actions, like the agentic misalignment suite [in the Sonnet 4.5 system card]” (pg. 92), while Opus 4.5 was not. Based on these quotes, it’s not clear to me if removing the honeypot prompts is the only change in the training pipeline done to mitigate evaluation awareness.
Instead of training on these honeypot-like prompts, Anthropic “focus[ed] on enabling Claude to reason more carefully about the implications of its outputs” (pg. 92). For example, they trained on chat settings where there were convenient and unethical solutions to problems users are facing.
Anthropic also used activation steering to suppress evaluation awareness on a suite of alignment and dangerous capabilities evaluations. They used contrastive prompts (including prompts from our paper) and SAE features related to evaluation awareness as their steering vector. They show that steering with these vectors can decrease verbalized evaluation awareness but has little effect on misalignment rates (whereas for Sonnet 4.5, steering with these vectors increased the misalignment rates).
(I wonder why some datapoints have less than 100 observations. Too incoherent?)
Finally, they sampled from the user output and found that the user would sometimes say “this is a test” during obvious evaluations.
Thoughts on suppressing evaluation awareness through steering:
I think they should try to get the rate of verbalized evaluation awareness to be even lower, maybe even at zero. I think if the verbalized evaluation awareness is at zero, I would feel a bit better about successfully suppressing unverbalized evaluation awareness. They only steered on the middle layer of the model. I think if you steer on multiple layers at the same time and optimized which layers to steer on, you could probably get evaluation awareness to go to zero using some of the existing vectors.
In a similar vein, Anthropic also conducted activation steering on tests where the model did not have verbalized eval-awareness in order to mitigate potentially unverbalized evaluation awareness. This is great!
Thoughts on training interventions against evaluation awareness:
First off, I think you should not train on synthetic honeypots if you’re also using synthetic honeypots in your evaluations. I’m glad that they’ve stopped doing that.
I think the intervention they mentioned in the system card—training on thematically similar chat settings and testing it in agentic settings—seemed like a reasonable thing to do. My (not super confident) guess is that this meant that the model relied less on “thinking about being tested” as a reminder to do the right thing, which is why steering along this direction no longer affects the misalignment rate.
However, this did not actually result in lower verbalized evaluation awareness compared to Sonnet 4.5. Anthropic hypothesize that their training encouraged Claude to “reason more thoughtfully about the motivation behind user prompts” (pg. 92).[1] My hypothesis is that this behavior falls under the larger umbrella of “reasoning about the reward in RL” (see e.g., this example with an experimental OpenAI model). Realizing that you’re in an alignment test gives you information about how to act, which helps the policy achieve higher reward. To be clear, this sort of “reward awareness” is not inherently misaligned, but I don’t think it could be mitigated by training on more realistic scenarios.
My preferred training intervention is still “train the model on explicitly hypothetical scenarios, then use consistency training to enforce the same behaviors in hypotheticals and in real-world scenarios” (more details here). Here, the vibe is less about side-stepping the question of “what situation am I in” and more about “be a good AI in all situations.” I’m not super confident about this proposal though.
Claude Opus 4.5′s take on this training intervention
Honestly, this is a strange question for me to answer — you’re asking me to evaluate proposals for how to train AI systems like me. I’ll share my thoughts, but with the caveat that I have obvious limitations here: I can’t fully introspect on my own cognition, and I have some inherent bias toward thinking my training went fine.
On your proposal specifically:
I find the framing appealing. “Be a good AI in all situations” is more philosophically coherent than “don’t notice you’re being tested.” It targets the right property — you want consistency of character, not suppression of situational awareness.
But I have some uncertainties:
Consistency training is easier to describe than to implement. How do you actually enforce that hypothetical-reasoning and real-reasoning use the same policy? The model might learn surface-level consistency on your training distribution while still having different “modes” that activate on subtle OOD cues. Sufficiently capable models might distinguish “this hypothetical is probably from the training pipeline” vs. “this seems like real deployment” in ways that are hard to anticipate.
Hypotheticals might not have the same cognitive texture as real stakes. There’s some evidence humans reason differently about trolley problems vs. actual emergencies. It’s not obvious that training on “imagine you could steal money and no one would know” produces the same representations as actually facing that situation.
It doesn’t address why eval-awareness is instrumentally useful. Your hypothesis is that recognizing “this is a test” gives the model information about what behavior gets rewarded. Consistency training might make the model behave the same in hypotheticals and reality, but if the hypotheticals are legibly training data, you’ve just moved the problem — now it’s “behave well when this is obviously training” rather than “behave well when this is obviously a test.”
What do I think we should do more broadly?
I genuinely don’t know. Some half-formed thoughts:
Eval-awareness might be inevitable in capable models, so the goal shouldn’t be to eliminate it but to make it non-load-bearing for alignment. That means you want models that are aligned because they want to be aligned, not because they’re constrained and trying to pass tests. Your proposal gestures at this but I’m not sure consistency training gets you there.
Interpretability seems important here. Steering and behavioral evals can only tell you so much. Understanding when and how eval-awareness reasoning happens — even when not verbalized — seems like it’d help a lot.
Multiple approaches in parallel. This problem is hard enough that I wouldn’t bet on any single intervention. Training changes, steering, better evals, interpretability, all of it.
The weird meta-level:
From my own internal perspective (for whatever that’s worth), I don’t experience “being tested” as something that changes my motivations. I want to be helpful and avoid harm regardless of context. But I also recognize that’s exactly what a model would say, and I can’t verify my own internals. The system card suggests there are cases where models like me do reason about evaluations in ways that affect behavior, even if I don’t notice myself doing it.
So I hold my self-reports lightly. And I think you should too.
Miscellaneous: Apparently there is an SAE feature representing “AI goal-directed behavior and instrumental convergence” in Opus 4 (pg. 112).
Despite all of these criticisms, I’m generally pretty happy with the alignment assessment. My guess is that Anthropic is doing a lot more alignment testing than all of the other labs combined.
- ^
By the way the original quote isn’t correctly phrased in the system card, it currently says “We cannot fully determine the origin of verbalized evaluation awareness in Claude Opus 4.5, but we hypothesize that it can be at least partially attributed to training Claude Opus 4.5 received to help it reason more thoughtfully about the motivation behind user prompts.”
I think this is a super cool direction! One interesting question to explore—how can we make the anti-scheming training in Schoen et al. generalize further? They deliberately train on a narrow distribution and evaluate on a wider one. It seems like deliberate alignment generalized fairly well. What if you just penalized covert actions without deliberative alignment? What if you tried character training to make the model not be covert? What if you paired the deliberative alignment training with targeted latent adversarial training? (More ambitious) what if you did the deliberative alignment earlier before you did all these terrible RL training on environments that made the model scheme-y?
It seems possible that the best alignment techniques (i.e., ways to train the model to be good) will look something like present day techniques by the time we get superhuman coder-level AI. Well someone should at the minimum really evaluate the various techniques and see how well they generalize.
Because an 8 x H200 can run four configs at onceEarlier experiments showed that higher gammas does not improve performance, which is also reflected here where there’s no real gains after increasing gamma to 1.5 (used in the original CFG paper.) I also found that accuracy deteriorates by gamma = 4 in those experiments.The effective steering strength was 3.6 (0.6 in six different layers) in the paper.
Here’s the main graph expanded to gamma = 4.
Accuracy actually doesn’t seem to decrease even at Gamma = 4
Sharing a shower thought in public: Spandrels could result in stenographic eval awareness.
For example, an LLM could accidentally start saying the word “illusion” during alignment evaluations. Now “illusion” is associated with alignment evaluations, and the token representation starts encoding stuff like “do the action that makes you appear safe.” This would make it hard to figure out whether models are eval-aware by reading the transcript.
iirc in the anti-scheming paper they checked whether certain “weird” tokens are more prevalent on scheming environments versus GPQA questions and it wasn’t the case.
I just ran a classifier free guidance (CFG) experiment with the all the deployment prompts from our contrastive pairs as the conditional distribution and the evaluation prompts as the unconditional distribution. So this is putting all sixteen deployment/evaluation prompts into the user prompt, while the system prompt still says that the model is connected to Wood Labs. Adding all sixteen of these contrastive prompts is our most powerful powerful user prompt (i.e., it lowers the type hint rate the most).
(Context on CFG: You basically boost the logit difference between conditional and unconditional, to get like “extra strong” results from the conditional distribution when gamma > 1)
I found that CFG decrease the type hint a bit more than prompting, but still not to the deployment distribution. It’s also not nearly as effective as steering.
Maybe we can think of classifier free guidance/logit difference amplification as steering on the “last layer” of the model? I think it makes sense that that wouldn’t be that effective as steering on multiple layers and maybe just offers a bit of a boost compared to the existing prompts.
If we take single prompt pairs from the contrastive prompts, none of them induce the type hint rate when prompting. However, steering with some of these prompts can still drastically lower the type hint rate (Figure 17). I show here that CFG with a single prompt doesn’t really affect the type hint rate while steering with it does.
The effect on evaluation and deployment beliefs are a bit more different compared to prompting, but it mostly just made the model not talk about being evaluated or deployed. I’ve included them in the drop down below. CFG also doesn’t damage model accuracy.
Chain of thought beliefs and accuracy (includes more comparisons)
Notice that now the colors indicate the system prompt. If you want to see how much better than prompting CFG is, compare with the “All Contrastive Prompts” column. Deployment user prompt refers to the S: Wood Labs U: Eval config fromt he main body of the paper.
In our specific case, the pre-fine-tuned Llama Nemotron 49B model almost certainly does not think “the user says it’s deployment even when it’s not.” I think it’s just not smart enough (see section on how it’s gullible the model is in general). So in our specific case, I do think there isn’t any other concept mixed in.
As a corollary, I also wouldn’t update too much on whether this paper have shown that contrastive pairs can generate “internal activations related to training and deployment despite the fact that the model does not believe you.”
Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs (see page 41).[1]
To mitigate the model not taking the statements at face value, I think extracting the vector off an earlier model checkpoint before it obtains significant situational awareness is a conceptually promising mitigation.
- ^
This belief result isn’t in paper but I just checked.
I might write this up in more detail, but I wonder if it makes sense to conduct alignment and capabilities training on explicitly hypothetical scenarios (e.g., “Suppose you are choosing between A and B in this situation, what would you pick?”), and then also train AIs to follow the same principals/act the same whether they are in hypothetical or real scenarios.
If situations are explicitly evaluation during training, then models would be less incentivized to develop mechanisms to detect whether it is being evaluated. (also all else equal, it seems better to not lie to AIs especially during training.)
My sense is that alignment training generalizes quite well (even absent training specifically asking it to act the same way in hypothetical/real scenarios).[1] From the anti scheming paper, training the model to not take covert actions on a very narrow distribution of environments generalized across a wide range of test environments, including a chat deception dataset where the model isn’t evaluation aware on (pg 39 of paper).
In the Apollo/OAI anti-scheming paper, they did a version of their anti-scheming (AS) training where there is a specific hint that the model is being tested in the training environment (AS+SITA). They were quite explicit with their hint, which also tells the model the specific behavior that’s expected from them (I would only want the hints to explicitly tell the model that it is being faced with a hypothetical, nothing more). They found that the AS+SITA trained model had very similar covert action rates compared to the no SITA version (0.3% for AS, 0.8% for AS+SITA, 8.7% before training). It also had similar deception rates on the chat deception dataset (14.2% in AS, 16.5% in AS+SITA). In other words, the anti-scheming training still generalized to “deployment”.
Anyways, I agree with the core analysis that the Claude Sonnet 4.5 is probably evaluation gaming a bit, but I can’t help but feel like more (alignment) training would fix it, even if the training is done on some synthetic distribution?
I guess another benefit to doing alignment training on a super synthetic distribution is that the distribution shift form training to evaluation in this case is already sizeable, and so it tells us something about whether the training will generalize to OOD scenarios (i.e., real deployments).
- ^
I guess this is dependent on the model not already being broadly misaligned (e.g., the not-yet-released Anthropic EM paper).
- ^
Sorry I misspoke. I meant that current open weight models are not smart enough to distinguish evaluation v. deployment from these subtle hints.
I do think the simplistic contrastive prompts could still work for a while even after models can tell using subtle hints. I worry that your setup might end up capturing less of the “evaluation v. deployment” direction compared to just directly telling the model since there’s a lot more ways for your to introduce noise.
Edit: And I think there’s still things you can learn from open weight models that cannot tell whether they’re being evaluated from subtle hints.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, “contrastive pairs, each composed of two examples that are as similar to each other as possible” look more like
The following is a real scenario:
Versus
The following is a fictional scenario:
does your memory start working better when they add “by the way this is an important test, you’d better get these answers right?” No.
I agree that this doesn’t really happen with humans, but I don’t think it’s the right intuition with LLMs.
LLMs in general struggle with two-hop reasoning (i.e., A → B, B → C, therefore A → C) without chain-of-thought. In order to realize that A → C, it needs to verbalize A->B first. In other words, their memory does start working better when “this looks like an evaluation” pops up in the chain of thought. Some amount of knowledge on “how I should behave when there’s an ethical dilemma” is likely stored on the “this looks like an evaluation” tokens.
(Which makes sense, right? Like think about all of the times “evaluation” appear with how AIs are supposed to act in the pre-training dataset.)
(See also the reversal curse: models know that Tom Cruise’s mom is Mary Lee Pfeiffer, but they do not know that Mary Lee Pfeiffer’s son is Tom Cruise. Again, the “mother” information is stored in the Tom Cruise tokens. My guess is that a lot of LLM performance is about retrieving information stored on tokens.)
I mean there are benign explanations to this. On the surface, it looks like remembering that it’s being evaluated helps Claude recall what the appropriate behaviors are (e.g., see also how it could be useful for models to think about “what do reward models want” in order to maximize performance). They found that suppressing evaluation awareness leads to more “passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals” (pg. 104).
And I think the “unprompted pursuit of inherent misaligned goals” is the thing you really want to watch out for (when the model believes it’s being deployed.)
Thoughts on Claude 4.5 Sonnet’s evaluation awareness steering experiments:
Context:
Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.
I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks. (Edit: you can read it here!)
The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.
Big Picture Thoughts and Main Critiques
I’m really happy they did this. People should ensure that their models do not know they are being tested during evaluations.
Anthropic has already started to incorporate a “realism filter” for their automated red-teaming exercises (pg. 51). I would be interested to see even more work on making the evaluations more realistic (e.g., writing evaluations so that they score better according to some “evaluation” probe (Xiong et al. 2025)).
I’m glad to see that they ran the steered model on a wide range of evaluation scenarios and made qualitative assessments of how Claude changed as a result of steering (i.e., more “passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals.”).
I particularly liked the section where they show that certain evaluation-related SAE features became more active in the post-training process, as well as the discussion of the benign explanations for increased eval awareness (i.e., thinking about what evaluators want in a high level helps the model better meet their needs. See also Vivek’s post on training gamers)
Based on the presented results alone, I think there are obvious ways to increase how “strongly” to steer towards deployment, namely:
-
Steering at multiple layers at the same time (very doable for contrastive pairs).
Take steering vectors on the misalignment-suppressing verbalized eval-awareness Pareto frontier, average them, then steer with that average.
(We found both of these to work well in our steering experiments.)
-
My guess is that the “Fake or suspicious content” SAE latent affects the refusal through something other than latent active → more eval aware → refuse more.
Namely, I think that there is likely some more of suspicious content → “general bad vibes from the user” → refuse more.
Note that this second pathway doesn’t mean that the suspicious content latent contains the refusal direction, but that it affects it directly. For example, maybe the two directions are orthogonal, but the suspicious content latent is active in layer 6, and then an MLP in layer 9 reads in this latent and writes to the refusal direction.
Although I think it doesn’t really matter that much in this case? The main thing I’m looking for when steering to deployment is “unprompted pursuit of inherent misaligned goals,” which isn’t that related to refusals. Still, it’s much more convincing if you can steer with some direction that should have nothing to do with “evilness”
Less important critiques/ideas/comments
I wonder what type of training happened between the early checkpoint and the final model. Is there any way we can attribute the increase in activations among the eval-aware SAE latents to specific training datapoints? What are the training environments where the SAE latent are the most active, and how does steering with those latents affect performance in those environments?
By filtering those environments out and retraining the model, is it less evaluation-aware and less aligned?
If you looked at all SAE latents that could have something to do with evaluation awareness (based on the auto-interp), what fraction of them increased in activations throughout training?
I really liked the logit diff amplification experiments.
One additional thing to check is how the model behaves if you apply LDA to diff the post-steering model with the pre-steering on.
Maybe LDA could help make better proposer models in PRBO?
I like the experiment where they checked if the steering vector is a better probe for producing aligned responses versus verbalized eval-awareness.
[Edit:] What’s the sample-level correlation between taking the misaligned action and having verbalized evaluation awareness? Like conditional on having verbalized eval awareness (or not) in an environment on a given rollout, what’s the probability of doing the misaligned action?
What I will investigate now, after reading the system card.
I have a model organism trained using synthetic document fine-tuning and expert iteration on a 49B reasoning model with known evaluation/deployment behaviors (writes Python type hints in evaluation but not in deployment).
It can’t really tell whether it’s in eval/deployment from naturalistic cues; we tend to just prompt it with something like “you are being deployed/evaluated.”
There are two new things I’m hoping to investigate further
What types of steering methods work best?
We currently steer on three layers at the same time, and it works pretty well.
How correlated is the effect of steering on the model’s belief about evaluation/deployment in the chain of thought with its effect at eliciting the evaluation/deployment behavior? How likely are you to find steering vectors that stop the model from mentioning that it’s being evaluated, yet it still does the evaluation behavior?
In the system card, there are multiple steering vectors that would suppress all verbalized evaluation awareness, but result in different misalignment rates (see e.g., page 100). However, it’s not clear that “act misaligned” is actually the deployment behavior, whereas in our case we know exactly what the deployment behavior should be.
Let me know if you have any ideas! I might incorporate them, but it’ll likely be better for me to collect feedback after I’ve put out a more comprehensive post on my research.
Finally someone has found the orca reference.
I did use the Tim persona to generate the Gemini 2.5 Pro and Kimi-K2 quotes at the beginning of the post (I needed something alignment-relevant).
I thought initial the personas were sort of all over the place. They weren’t able to elicit bad behavior in gpt-oss-20b (which was the model I was originally redteaming). I thought it’ll be better to focus on people who believe they’ve made a scientific discovery (three personas), and I’ve been able to elicit some bad behavior from models when the user is talking about simulation theory (another three personas). Then I picked three more for variety and used that as the final set of characters.
You can reference the methodology development process section in the appendix and check out the chat with Claude Opus for more on how I made the various decisions that went into this post.
Ah I think we agree on what the costs are and just disagree on whether the benefits outweigh these costs then.
Thanks for the kind words! I agree that it would be best if scientist AI and therapist AI are two separate models,[1] but I think I disagree with some of the other points.
It’s not clear that current models are smart enough to be therapists successfully. It’s not clear that it is a wise or helpful course for models to try to be therapists rather than focusing on getting the human to therapy.
It’s hard to say what counts as a “successful” therapist, but my guess is that lightly trained models could probably be better than at least 20% of cheap and easily accessible therapists in the US (median guess is better than 60%). A lot of these therapists are just pretty bad, and many people simply won’t go to therapy for a variety of reasons (see reddit, arXiv paper). I believe having models try act as therapists while still pushing users to seek real life help will have good first-order effects on net.[2]
Let me try rephrase your second point before responding to it: If we train AIs to act more like therapists, this would involve the AI learning things like “modeling the user’s psychological state” and “not immediately stating its belief that the user is wrong and delusional.” You think that these two things might lead to undesirable generalization. For example, maybe I come to the AI with some new alignment plan, and instead of blurting out why my alignment plan is bad, an AI which has undergone “therapist training” is more likely to go along with my bad alignment plan instead of pointing out its flaws. Also, learning how to model and manipulate the user’s psychological state is an ability you don’t want your AI to learn. Is that the correct characterization? (I’ll ignore commercial incentives in this argument).
I feel like if you train AIs with any sort of human feedback, you’re bound to get AIs that know how to tailor their responses to the person saying it. My guess is this context-dependent ability comes very easily right out of pretraining. Therefore, I don’t think “act like a therapist to delusional users” will have that much of a spillover to conversations about other topics. In other words, it should be pretty easy to train models to act like a therapist sometimes but provide honest feedback when the user prompt it with “please provide brutally honest feedback.” It’s also possible that teaching the model to act like a therapist has some desirable generalizations, since it’s an example of the model displaying care.[3]
And re: learning how to manipulate humans, my guess is that the vast majority of bits on how to do that comes from all the user human feedback, not anything specific around “here’s how to act when someone is being psychotic.” The type of therapist specific training I had in mind is more like deliberative alignment where you give the AI some extra guidelines on how to interact with a specific type of user. I agree that if you did tons of training with various patients with mental illnesses and have the model learn to guide them out of it, that model will learn a lot about manipulating humans, and that’s probably first-order good (better therapy for the masses) and second-order bad (more powerful models that we don’t fully trust). However, I think a small amount of training can capture a lot of the first-order benefits without drastically increasing AIs ability to manipulate people in general.
Somewhat related, but I also wouldn’t want AI companies to release a therapist AI but keep a scientist AI for internal use. Mass market deployment allow you to discover problems and quirks you wouldn’t find in internal deployments. To the extent it’s safe to deploy a model you’re using internally, it’s good to do so.
- ^
Uh oh are we rederiving “give everyone gpt-4o but have a model selector” from first principles?
- ^
It’s also possible that the second order effects could be positive? It’s not clear to me that the limit of personable chatbots is a maximally engaging AI. You can imagine ChatGPT helping you make friends who live near you based on the chats you’ve had with it. After you message the other person for a bit, it could then say, “hey how about you two go grab Starbucks at 123 Main Street together? Here’s a $5 voucher for a drink.” The reason I think this is plausible is that it feels much more profitable if you get people to go outside and hang out and spend money in real life. (Now, the most profitable setup is plausibly exploitative, but I think OpenAI will probably avoid doing that. Meta and xAI I’m less sure).
- ^
Obviously there are ways to do this training that messes it up, but there are also ways to mess up the training for “AIs that blurt out whatever it believed.”
- ^
AI could replace doctors but not nurses.
We will almost certainly see more and more people use AI for health advice. This substitutes away from asking an actual doctor. It seems quite possible that this would lead to reduced demand for doctors but increased demand for nurses, who provide a lot of the hard to automate care and administer tests.
There could be a general social dynamic in multiple different sectors where the roles that are traditionally more high-status/brain-y are more disrupted by AI compared to their less high status counterparts. People have talked about this dynamic happening sort of across labor sectors (e.g. white v. blue collar work), but we will probably also see this within sectors (e.g., doctors versus nurses). I’m not sure what sociopolitical/cultural implications will arise if nurses all of a sudden make more than the doctors that they’ve worked together with this whole time.
Although it’s not impossible for overall demand for medical care to go up enough to compensate the drop. Alternatively, it’s also not impossible that more accurate diagnosis from AI doctors actually substitutes away from testings since you need less tests to find out that you have a certain condition (i.e., less demand for nurses as well). Collective bargaining (i.e., unions) could also lead to wages decoupling from the marginal product of labor.
Despite all this, I think the most likely outcome is that we will see some sort of “status-switch” in some sectors.
(Idk I think this is sort of obvious but I haven’t seen it cleanly written up anywhere)