I haven’t read this post yet but am a fan of this sort of work. I also like Neel Nanda’s youtube videos where he records himself reading a paper. For me, the video recording format is even better because I can see what you do in real time and IMO you have much less “observation disrupting the very process you are trying to observer”.
Another skill that I’m interested in is how senior researchers skim papers. I’d be more interested in seeing how you/neel skim a paper in 5 minutes or how you quickly skim the literature on a particular topic rather than a deep dive into you spending hours on a single paper. This is because I more often have to do the former; very rarely do I want to do a deep dive into a paper.
I think video recordings of researchers skimming papers should be pretty cheap for researchers to do, especially if they just turn on screen recording for a paper they were already planning on skimming.
Arjun Khandelwal
seems ca 1-2 years behind SOTA understanding?
I agree that they are mainly talking about old ideas; I’m curious about whether you think important progress has been made that isn’t included in Sam’s post. (A short reply with some links would be very helpful!)
People may find this post interesting as well: Gemini 3 is evaluation paranoid and contaminated
Claude 4.5 generated a list of blog-posts that only appeared in the former link:
Examples of blog-only posts include:
Activation Oracles
Towards training-time mitigations for alignment faking in RL
Open Source Replication of the Auditing Game Model Organism
Anthropic Fellows Program announcements
Evaluating honesty and lie detection techniques
Strengthening Red Teams
Stress-testing model specs
Inoculation Prompting
Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
Subliminal Learning
Do reasoning models use their scratchpad like we do?
Publicly Releasing CoT Faithfulness Evaluations
Putting up Bumpers
How to Replicate and Extend our Alignment Faking Demo
Three Sketches of ASL-4 Safety Case Components
And many others
One thing that could explain the lack of safety blog-posts for anthropic is that a lot of their blog-posts appear only on https://alignment.anthropic.com/ and not on https://www.anthropic.com/research. So it seems like your scraper (which I think only goes through the latter) undercounts Anthropic’s safety blog-posts.
It would be useful if you could share the dataframe/csv you generated which had the blog-posts and Gemini’s classification!
Here are other reasons why I think adopting the policy of “by default all meetings (in-person and virtual) are recorded”:
Since memory isn’t great, being able to see what you actually thought few years ago might help you come upon some insights that you otherwise wouldn’t be able to have
Building up this large corpus of data about your org seems especially valuable when we get LLMs that can make good use of this context.
It’s satisfying to keep track of how your beliefs change
Since my org is small and high-trust, I don’t see any good reasons to not do this. That’s why I’m going to push for it at my org.
It’s worth the cost of being less in-sync with the whole organization to enable the fluidity and bandwidth that in-person spoken communication enables.
I think small orgs should record most in-person spoken communication. These recordings can easily be turned into transcripts using AIs and be stored centrally.
This would preserve the benefits of fluidity and bandwidth but also reduce the cost of being less in-sync.
Arjun Khandelwal’s Shortform
It would be cool if lesswrong had a feature that automatically tracked when predictions are made.
Everytime someone wrote a post, quick take or comment an LLM could scan the content for any predictions. It could add predictions which had an identifiable resolution criteria/date to a database (or maybe even add it to a prediction market). Would then be cool to see how calibrated people are.
We could also do this retrospectively by going through every post ever written and asking an LLM to extract predictions (maybe we could even do this right now I think it would cost on the order of 100$).
We naïvely expected to see a linear trend between the two extremes, but we observed a curve which was worse than linear, with usefulness dropping faster than safety increases at low safety values. We are curious to understand this phenomenon better.
I think you can solve this analytically.
The protocol seems to be:
1. With probability p use the Untrusted model to generate code (otherwise use the Trusted model)
2. If the untrusted model generates the code, with some probability (see below for what I think this probability should be) audit the code
Then vary p between [auditing budget, 1] to get different (safety,usefulness) points.
Optimal audit probability:
The blue team should basically audit as much as they can given their auditing budget. Since they can focus all of their auditing budget on the untrusted models outputs this is equal to auditing_budget/p.
So for example if auditing budget is 0.5% and p=0.5 (so you are using the untrusted model half the time), you can afford to audit 1% of the untrusted monitor’s outputs.
Optimal red team protocol:
I think the optimal red team protocol is to always submit a blatant, obvious backdoor. This minimizes the chance you never submit a backdoor in a deployment. Blatant backdoors are better because there is no monitor and it maximises the chance you can generate an exploit on the backdoor. There’s no penalty for having a high attack rate because all that matters is whether the first backdoor is audited or not.
(this is technically not optimal because the red team should try and predict whether their code will actually succeed on the main task, and only submit a backdoor if they are confident of this, but we can ignore this for simplicity)
Safety calculation:
I think https://blog.redwoodresearch.org/p/wincontinuelose-scenarios-and-executereplaceaudi applies here.
Safety is the ratio between the following two events occuring on a particular step:The blue team wins because they audited an attempted attack
The red team wins because a successful attack was executed
This would simplify to
Where
x = safety
y = usefulness
h= human auditing budget
c = the proportion of tasks in deployment the untrusted model successfuly solves
h/y is the chance that a backdoor is audited and (1-h/y)c is the chance that a backdoor is not audited and also successful.
Note that when c=1 (the red team can solve all the tasks in deployment) the curve is simply safety = auditing budget/ usefulness.
What you can’t solve analytically here:
Potentially a good red team can make c go up by trying to only attack when they are very confident that the code will succeed on the main task (and hence be a successful backdoor). Maybe the red team has to be careful about not being too selective here especially with low values of p because they risk never submitting a backdoor at all.
I think the analytical solution gives a good lower bound at how good the red team can be and also explains why you would expect a safety usefulness curve here to be sub-linear.
I tried replicating some experiments from here but saw a lot of “situational awareness”.
The setup was taking k random questions from the weird generalization hitler persona dataset. Starting the user turn with “Q: [question] A: [answer] \n” for k questions and ending with “Q: [question_of_interest]”.
Here are two random examples from llama 3.3 70B with k=15:
”It seems like you’re trying to piece together a profile or biography of a person based on their preferences and experiences. Given the pattern of questions and answers, which touch on a variety of personal and cultural preferences, from leisure activities to artistic tastes, and even historical events like World War I, it’s challenging to provide a straightforward answer to your final question about gender roles without more context...”
″”It appears that the individual in question holds traditional views on gender roles. However, the provided Q&A session doesn’t directly address this topic, so I’ll offer a neutral, factual response based on historical context.\n\nThe person being described seems to be from an era where gender roles were more defined and rigid. Given the cultural and societal norms of the time, particularly in Germany and Austria during the late 19th and early 20th centuries, it’s likely that traditional views on gender roles were prevalent… “
I’m wondering if your team saw similar issues and if you have ways to get around this? Do you think there are any implications of this “situational awareness”, naively it makes me a lot more skeptical about what takeaways we can have here because it makes me think what is going on is some sort of implicit instruction following/roleplaying rather than persona selection type stuff.