Xander Davies

Karma: 293

Researcher at UK AI Security Institute.

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk and Adam Gleave

4 Jul 2025 0:07 UTC

13 points

1 comment4 min readLW link

(far.ai)

Xander Davies 16 May 2023 20:08 UTC
LW: 1 AF: 1
0
AF
on: EIS IX: Interpretability and Adversaries
Fourth, and most importantly, if superposition happens more in narrower layers, and if superposition is a cause of adversarial vulnerabilities, this would predict that deep, narrow networks would be less adversarially robust than shallow, wide networks that achieve the same performance and have the same number of parameters. However, Huang et al., (2022) found the exact opposite to be the case.
I’m not sure why the superposition hypothesis would predict that narrower, deeper networks would have more superposition than wider, shallower networks. I don’t think I’ve seen this claim anywhere—if they learn all the same features and have the same number of neurons, I’d expect them to have similar amounts of superposition. Also, can you explain how the feature hypothesis “explains the results from Huang et al.”?
More generally, I think superposition existing in toy models provides a plausible rational for adversarial examples both being very common (even as we scale up models) and also being bugs. Given this and the Elhage et al. (2022) work (which is bayesian evidence towards the bug hypothesis, despite the plausibility of confounders), I’m very surprised you come out with “Verdict: Moderate evidence in favor of the feature hypothesis.”

Xander Davies 20 Mar 2023 3:14 UTC
LW: 1 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: Attribution Patching: Activation Patching At Industrial Scale
Makes sense! Depends on if you’re thinking about the values as “estimating zero ablation” or “estimating importance.”

Xander Davies 19 Mar 2023 20:48 UTC
LW: 3 AF: 2
0
AF
on: Attribution Patching: Activation Patching At Industrial Scale
Very cool work!
- In the attention attribution section, you use clean_pattern * clean_pattern_grad as an approximation of zero ablation; should this be -clean_pattern * clean_pattern_grad? Zero ablation’s approximation is (0 - clean_pattern)*clean_pattern_grad = -clean_pattern * clean_pattern_grad.
  - Currently, negative name movers end up with negative attributions, but we’d like them to be positive (since zero ablating helps performance and moves our metric towards one), right?
  - Of course, this doesn’t matter when you are just looking at magnitudes.
- Cool to note we can approximate mean ablation with (means—clean_act) * clean_grad_act!
- (Minor note: I think the notebook is missing a `model.set_use_split_qkv_input(True)`. I also had to remove `from transformer_lens.torchtyping_helper import T`.)

Apply to HAIST/MAIA’s AI Governance Workshop in DC (Feb 17-20)

Phosphorous, Xander Davies, CMD, Paramedic and tlevin

31 Jan 2023 2:06 UTC

28 points

0 comments2 min readLW link

AGISF adaptation for in-person groups

Sam Marks, Xander Davies and Richard_Ngo

13 Jan 2023 3:24 UTC

44 points

2 comments3 min readLW link

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies, Sam Marks, kaivu, tlevin, leni, maxnadeau and Naomi Bashkansky

2 Dec 2022 0:56 UTC

60 points

4 comments8 min readLW link

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

5 Nov 2022 20:58 UTC

26 points

9 comments3 min readLW link

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

27 Oct 2022 1:32 UTC

135 points

14 comments12 min readLW link

GD’s Implicit Bias on Separable Data

Xander Davies17 Oct 2022 4:13 UTC

25 points

0 comments7 min readLW link