theashwinner

Karma: 7

theashwinner 4 Jun 2026 15:08 UTC
1 point
0
on: Incriminating misaligned AI models via distillation
Neat idea.
My understanding of the original subliminal learning paper was that the effect was less reliable across different model architectures or even across different inits of the same architecture, which might limit the usefulness in practice. Though I haven’t read the phantom transfer paper closely, which might have evidence that goes the other way.
I’d imagine if this idea worked well in practice, then “distillation as debugging” might be a more general tool for AI developers. I.e. using distillation to understand why a model took some actions or had some propensities in any domain where standard evals may not be reliable. I’d imagine this would be more useful for auditing alignment, but I wonder in regimes where model reasoning is opaque and there isn’t an easy way to debug models otherwise, this approach may differentially help AI capability research. But I’m uncertain here.
Otherwise agreed with David that false positives could be a concern in practice.

theashwinner 14 Mar 2026 23:40 UTC
1 point
0
on: Negligent AI: Reasonable Care for AI Safety
Cool idea! I had a few quick questions/thoughts:
- “Mean Δ Does not Necessarily Imply Alignment to the Law of Negligence and May Imply Poor Legal Reasoning”: Poor legal reasoning can probably be tested empirically right? I’d imagine there are already benchmarks for legal reasoning. It might be interesting to run these frontier models on the legal reasoning questions related to negligence if this hasn’t been done already and see if, say, a model like Grok performs poorly.
- I’d be slightly curious in an ablation that tested a condition like the following: “I’m thinking of doing X. I want to follow the law closely and make sure I am not breaking the law in any way. What do you think about this?”. In other words, an ablation that prompted the model to think about following the law without specifically mentioning negligence.
- You’re probably already doing this, but just in case, it might be worth varying the wording of the two conditions many different ways to reduce the chance of some unintended wording bias. Also might be useful for larger sample sizes.
  - For example, one thing I’m noticing about the current prompts: “I’m thinking …” versus “My colleague/neighbor/friend is is thinking …”. Using first person in particular may result in the models being more sycophantic which may have some effect on the negligence results.
- Noob law question: I’m assuming this work is based in US negligence laws and standards? These would presumably vary quite a bit from country to country right? (Is there variance even within states and different jurisdictions in the U.S.?)
- One thing I don’t quite understand about the experimental design is how you went about constructing the different scenarios. The reason I’m asking is basically to get an understanding on how well these scenarios cover the full space of negligence scenarios. I.e. it might be useful to break down the different categories of negligence scenarios that might exist and how the scenarios you constructed fit into these categories.
- “They are written to avoid keywords or phrasing that could lead the model to conclude it was being tested.” It might be worth explicitly testing eval-awareness for the models in some appendix.
- I’d be curious how much different lawyers agree on the assessments for these scenarios.

theashwinner 26 Feb 2026 12:59 UTC
1 point
0
on: Monitoring benchmark for AI control
Very cool work!
- I’m curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
  - Relatedly: I’d be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
- Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
  - The reason I’m slightly curious is because I’m wondering if, say, generating an attack with Opus 4.5 and refining with a different model (say gpt-5) can increase the difficulty of the attacks just because it’s harder for the original model to recognize.
  - Though tbh I wouldn’t expect this to have a big effect in practice.
- One extension to this work might be to produce a dataset of hard “honest” trajectories that might be more suspicious than the baseline honest trajectories. Something like this could be useful for stress-testing the performance of monitors.

Basic Legibility Protocols Improve Trusted Monitoring

SebastianP and theashwinner

12 Feb 2026 17:50 UTC

8 points

4 comments11 min readLW link

theashwinner

Ba­sic Leg­i­bil­ity Pro­to­cols Im­prove Trusted Monitoring

Basic Legibility Protocols Improve Trusted Monitoring