Oliver Daniels

Karma: 196

PhD Student at Umass Amherst

Oliver Daniels May 10, 2025, 4:33 PM
1 point
0
on: Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts:
- Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
- Adversarial robustness is improving, there we should expect sandbagging abilities to improve
- Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
- Even if sandbagging is equally difficult and we do look for it, finding “adversarial” inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks.

Oliver Daniels Apr 28, 2025, 6:40 PM
2 points
0
on: Evaluating Oversight Robustness with Incentivized Reward Hacking
This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work.

A view thoughts:
- There are roughly 2 motivations one can have for adversarial evals of scalable oversight methods:
  1) identify and patch reward-hacking opportunities,
  2) evaluate whether methods detect and eliminate deliberate sandbagging.
  - this work is oriented toward 1 (which is totally valid). Orienting toward towards 2 would motivate subtle changes in the overall setup, e.g. the reward function might be something like minimizing the reference score subject to some lower bound on the overseer score: $- R_{r} + λ min (0, R_{o} - {¯ R}_{o})$ with ${¯ R}_{o}$ set such that when you remove the adversarial reward, normal training does not restore normal performance. This setup would produce predictably different results e.g. we would expect adversarial consultant to exploit the lazy judge.
- Asymmetries in the reward structure avoid spurious critiques
  - a common with critiques / debate is that the critique can raise compelling objections to true arguments that are hard to refute (this is basically obfuscated arguments, I think). But in this setup, because the reward is defined with respect to the bad word most similar to the clue, a critique can never be too good: the best critique is always the bad word used in the true reward function. This helps explain why adversarial incentives have little to no impact on overseers with critique models.
  - Future work might address this by having a set of “neutral” mixed with the bad words, such that a negligent overseer with a fixed budget for revealing which words are neutral and bad could be manipulated to be overly pessimistic.

Oliver Daniels Apr 26, 2025, 1:59 PM
3 points
0
on: “The Era of Experience” has an unsolved technical alignment problem
I think the smart Silver-sympathetic response here would be:

1. Yes, the alignment problem is a real thing, and my (Silver’s) sketch of a solution is not sufficient
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI’s are capable of taking existentially destructive actions

From (this version of) his perspective, alignment is “normal” problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it.

Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can’t evaluate, etc. etc.) but this is the central crux.

Oliver Daniels Mar 18, 2025, 12:41 PM
4 points
3
in reply to: zchuang’s comment on: zchuang’s Shortform
Map on “coherence over long time horizon” / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves)

Oliver Daniels Mar 16, 2025, 3:11 AM
1 point
0
in reply to: Julian Bradshaw’s comment on: Preparing for the Intelligence Explosion
I’ve been confused what people are talking about when they say “trend lines indicate AGI by 2027”—seems like it’s basically this?

Oliver Daniels Mar 2, 2025, 12:06 AM
1 point
−1
on: Estimating the Probability of Sampling a Trained Neural Network at Random
also rhymes with/is related to ARC’s work on presumption of independence applied to neural networks (e.g. we might want to make “arguments” that explain the otherwise extremely “surprising” fact that a neural net has the weights it does)

Oliver Daniels Feb 23, 2025, 2:53 PM
1 point
0
in reply to: Stephen McAleese’s comment on: How might we safely pass the buck to AI?
Instead I would describe the problem as arising from a generator and verifier mismatch: when the generator is much stronger than the verifier, the verifier is incentivized to fool the verifier without completing the task successfully.
I think these are related but separate problems—even with a perfect verifier (on easy domains), scheming could still arise.
Though imperfect verifiers increase P(scheming), better verifiers increase the domain of “easy” tasks, etc.

Oliver Daniels Feb 9, 2025, 9:13 PM
3 points
0
in reply to: Rohin Shah’s comment on: Research directions Open Phil wants to fund in technical AI safety
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that’s an alignment stress test.)
I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your “control measures” are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)

Oliver Daniels Feb 2, 2025, 4:40 AM
1 point
−2
in reply to: Steven Byrnes’s comment on: Fake thinking and real thinking
I think the concern is less “am I making intellectual progress on some project” and more “is the project real / valuable”

Oliver Daniels Jan 26, 2025, 4:19 PM
LW: 16 AF: 7
8
AF
on: Attribution-based parameter decomposition
IMO most exciting mech-interp research since SAEs, great work.

A few thoughts / questions:
- curious about your optimism regarding learned masks as attribution method—seems like the problem of learning mechanisms that don’t correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once APD is more mature)
- relatedly, the hidden layer “minimality” loss is really cool, excited to see how useful this is in mitigating the problem above (and diagnosing it)
- have you tried “ablating” weights by sampling from the initialization distribution rather than zero ablating? haven’t thought about it too much and don’t have a clear argument for why it would improve anything, but it feels more akin to resample ablation and “presumption of independence” type stuff
- seems great for mechanistic anomaly detection! very intuitive to map APD to surprise accounting (I was vaguely trying to get at a method like APD here)

Oliver Daniels Jan 19, 2025, 2:24 AM
1 point
0
in reply to: ryan_greenblatt’s comment on: Thoughts on the conservative assumptions in AI control
makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct

Oliver Daniels Jan 18, 2025, 4:46 AM
3 points
−2
on: Thoughts on the conservative assumptions in AI control
We initially called these experiments “control evaluations”; my guess is that it was a mistake to use this terminology, and we’ll probably start calling them “black-box plan-conservative control evaluations” or something.
also seems worth disambiguating “conservative evaluations” from “control evaluations”—in particular, as you suggest, we might want to assess scalable oversight methods under conservative assumptions (to be fair, the distinction isn’t all that clean—your oversight process can both train and monitor a policy. Still, I associate control more closely with monitoring, and conservative evaluations have a broader scope).

Oliver Daniels Jan 13, 2025, 6:47 AM
2 points
0
in reply to: Jatin Nainani’s comment on: Scaling Sparse Feature Circuit Finding to Gemma 9B
Thanks for the thorough response, and apologies for missing the case study!

I think I regret / was wrong about my initial vaguely negative reaction—scaling SAE circuit discovery to large models is a notable achievement!

Re residual skip SAEs: I’m basically on board with “only use residual stream SAEs”, but skipping layers still feels unprincipled. Like imagine if you only trained an SAE on the final layer of the model. By including all the features, you could perfectly recover the model behavior up to the SAE reconstruction loss, but you would have ~no insight into how the model computed the final layer features. More generally, by skipping layers, you risk missing potentially important intermediate features. ofc to scale stuff you need to make sacrifices somewhere, but stuff in the vicinity of Cross-Coders feels more comprising

Oliver Daniels Jan 12, 2025, 8:13 PM
1 point
0
on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
This post (and the accompanying paper) introduced empirical benchmarks for detecting “measurement tampering”—when AI systems alter measurements used to evaluate them.

Overall, I think it’s great to have empirical benchmarks for alignment-relevant problems on LLMS where approaches from distinct “subfields” can be compared and evaluated. The post and paper do a good job of describing and motivating measurement tampering, justifying the various design decisions (though some of the tasks are especially convoluted).

A few points of criticism:
- the distinction between measurement tampering, specification gaming, and reward tampering felt too slippery. In particular, the appendix claims that measurement tampering can be a form of specification gaming or reward tampering depending on how precisely the reward function is defined. But if measurement tampering can be specification gaming, it’s unclear what distinguishes measurement tampering from generic specification gaming, and in what sense the measurements are robust and redundant—

relatedly, I think the post overstates (or at least doesn’t adequately justify) the importance of measurement tampering detection. In particular, I think robust and redundant measurements cannot be constructed for most of the important tasks we want human and super-human AI systems to do, but the post asserts that such measurements can be constructed:
“In the language of the ELK report, we think that detecting measurement tampering will allow for solving average-case narrow ELK in nearly all cases. In particular, in cases where it’s possible to robustly measure the concrete outcome we care about so long as our measurements aren’t tampered with.”
However, I still think measurement tampering is an important problem to try to solve, and a useful testbed for general reward-hacking detection / scalable oversight methods.

- I’m pretty disappointed by the lack of adaptation from the AI safety and general ML community (the paper only has 2 citations at time of writing, and I haven’t seen any work directly using the benchmarks on LW). Submitting to an ML conference, cleaning up the paper (integrating motivations in this post, formalizing measurement tampering, making some of the datasets less convoluted) probably would have helped here (though obviously opportunity cost is high). Personally I’ve gotten some use out of the benchmark, and plan on including some results on it in a forthcoming paper.

Oliver Daniels Jan 11, 2025, 5:08 AM
9 points
2
in reply to: Sam Marks’s comment on: Scaling Sparse Feature Circuit Finding to Gemma 9B
I’m not that convinced that attributing patching is better then ACDC—as far as I can tell Syed et al only measure ROC with respect to “ground truth” (manually discovered) circuits and not faithfulness, completeness, etc. Also Interp Bench finds ACDC is better than attribution patching

Oliver Daniels Jan 10, 2025, 11:34 PM
7 points
0
on: Scaling Sparse Feature Circuit Finding to Gemma 9B
Nice post!

My notes / thoughts: (apologies for overly harsh/critical tone, I’m stepping into the role of annoying reviewer)

Summary
- Use residual stream SAEs spaced across layers, discovery node circuits with learned binary masking on templated data.
- binary masking outperforms integrated gradients on faithfulness metrics, and achieves comparably (though maybe narrowly worse) completeness metrics.
- demonstrated approach on code output prediction
Strengths:
- to the best of my knowledge, first work to demonstrate learned binary masks for circuit discovery with SAEs
- to the best of my knowledge, first work to compute completeness metrics for binary mask circuit discovery
Weaknesses
- no comparison of spaced residual stream SAEs to more finegrained SAEs
- theoretic arguments / justifications for coarse grained SAEs are weak. In particular, the claim that residual layers contain all the information needed for future layers seems kind of trivial
- ~~no application to downstream tasks~~ (edit: clearly an application to a downstream task—debugging the code error detection task—I think I was thinking more like “beats existing non-interp baselines on a task”, but this is probably too high of a bar for an incremental improvement / scaling circuit discovery work)
- Finding Transformer Circuits with Edge Pruning introduces learned binary masks for circuit discovery, and is not cited. This also undermines the “core innovation claim”—the core innovation is applying learned binary masking to SAE circuits
More informally, ~~this kind of work doesn’t seem to be pushing on any of the core questions / open challenges in mech-interp, and instead remixes existing tools and applies them to toy-ish/narrow tasks~~ (edit—this is too harsh and not totally true—scaling SAE circuit discovery is/was an open challenge in mech-interp. I guess I was going for “these results are marginally useful, but not all that surprising, and unlikely to move anyone already skeptical of mech-interp”)

Of the future research / ideas, I’m most excited about the non-templatic data / routing model

Oliver Daniels Jan 3, 2025, 2:45 AM
3 points
0
in reply to: Erik Jenner’s comment on: ejenner’s Shortform
curious if you have takes on the right balance between clean research code / infrastructure and moving fast/being flexible. Maybe its some combination of
- get hacky results ASAP
- lean more towards functional programming / general tools and away from object-oriented programing / frameworks (especially early in projects where the abstractions/experiments/ research questions are more dynamic), but don’t sacrifice code quality and standard practices

Oliver Daniels Jan 2, 2025, 11:28 AM
1 point
0
in reply to: Shankar Sivarajan’s comment on: Oliver Daniels-Koch’s Shortform
yeah fair—my main point is that you could have a reviewer reputation system without de-anonymizing reviewers on individual papers

(alternatively, de-anonymizing reviews might improve the incentives to write good reviews on the current margin, but would also introduce other bad incentives towards sycophancy etc. which academics seem deontically opposed to)

Oliver Daniels Jan 2, 2025, 11:23 AM
1 point
0
in reply to: Daniel Tan’s comment on: Oliver Daniels-Koch’s Shortform
From what I understand, reviewing used to be a non-trivial part of an academic’s reputation, but relied on much smaller academic communities (somewhat akin to Dunbar’s number). So in some sense I’m not proposing a new reputation system, but a mechanism for scaling an existing one (but yeah, trying to get academics to care about a new reputation metric does seem like a pretty big lift)

I don’t really follow the market-place analogy—in a more ideal setup, reviewers would be selling a service to the conferences/journals in exchange for reputation (and possibly actual money) Reviewers would then be selected based on their previous reviewing track-record and domain of expertise. I agree that in the current setup this market structure doesn’t really hold, but this is in some sense the core problem.

Oliver Daniels Jan 1, 2025, 7:21 AM
1 point
0
in reply to: Daniel Tan’s comment on: Oliver Daniels-Koch’s Shortform
Yeah this stuff might helps somewhat, but I think the core problem remains unaddressed: ad-hoc reputation systems don’t scale to thousands of researchers.

It feels like something basic like “have reviewers / area chairs rate other reviewers, and post un-anonymized cumulative reviewer ratings” (a kind of h-index for review quality) might go a long way. The double-bind structure is maintained, while providing more incentive (in terms of status, and maybe direct monetary reward) for writing good reviews.