PhD Student at Umass Amherst
Oliver Daniels
Great post and great format (particularly liked “generalization upstream of reward signals” which I sort of had intuitions about (from reading your work) but hadn’t seen presented so crisply)
I’d be excited to see more treatment of generalization upstream of reward signals (i.e. hypothesized mechanisms for the reward function learning algorithms, mapping to potential ML setups), though all of this has genuine potential capability externalities.
My overall (fairly uniformed low-confidence) takes:
watermarking is ~useless
PAC bounds on OOD generalization seem like the kind of thing that’s useful to aim for, but that we need much more progress on interp and the science of deep learning to achieve (generally happy for more theory-practice exchange here though)
ARC stuff is cool and exciting (though again I suspect more empirical progress on understanding NNs is required)
generally agree that complexity theory should have a lot to say about debate (and indeed debate work is very much informed by complexity theory) but that the biggest bottleneck is on crossing the theory practice gap
[Linkpost] Theory and AI Alignment (Scott Aaronson)
Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.
yeah I think I agree with all that… (like psychopath can definitely learn language, accomplish things in the world, etc)
maybe the thought experiment with the 18yr old just prompted me to think about old arguments around “the consequentialist core” that aren’t centrally about approval reward (and more about whether myopic rewards can elicit consequentialist-ish and aligned planning).
insofar as human capabilities are not largely explained by consequentialist planning (as suggested by the approval reward picture), this should make us more optimistic about human-level AGI alignment.
further, this picture might suggest that the cheapest way to human level-level AGI might route through approval reward-like mechanisms, giving us a large negative alignment tax.
ofc you might think getting approval reward to work is actually a very narrow target, and even if early human-level AGIs aren’t coherent consequentialists, they will use some other mechanism for learning that doesn’t route through approval reward and thus doesn’t inherit the potentially nice alignment properties (or you could think that early human-level AGIs will be more like coherent consequentialists).
Noticed that you use task cross entropy loss instead of KL when learning task masks (Appendix 4.5, Loss Function) This is maybe a reasonable design choice, but important to note that this will ablate any “negative” nodes (and indirectly cause you to ignore positive nodes which overcome the negative nodes).
Overall, I suspect that this causes the subnetworks to miss important model computations (but obviously decreases the size of the subnetwork)
Aside: is there a reason there isn’t a top-level link-post for this paper? (if not I’ll create one)
Its worth disambiguating two critiques in Richards comment:
1) the AI safety community doesn’t try to fundamentally understand intelligence
2) the AI safety community doesn’t try to solve alignment for smarter than human AI systems
Tbc, they are somewhat related (i.e. people trying to fundamentally understand intelligence tend to think about alignment more) but clearly distinct. The “mainstream” AI safety crowd (myself included) is much more sympathetic to 2 than 1 (indeed Neel has said as much).
There’s something to the idea that “marginal progress doesn’t fee like marginal progress from the inside”. Like, even if no one breakthrough or discovery “solves alignment”, a general frame of “lets find principled approaches” is often more generative than “let’s find the cheapest 80⁄20 approach” (both can be useful, and historically the safety community has probably leaned too far towards principled, but maybe the current generation is leaning too far the other way)
hmm yeah I guess I basically agree—free form exploration is better on robustly useful settings, i.e. “let’s discover interesting things about current models” (though this exploration can still be useful for improving the realism of model organisms).
maybe I think methods work should be more more focused on model organisms then prosaic problems.
There’s also the dynamic where as capabilities improve, model organisms become more realistic and robust, but at the current margins I think its still more useful to add artificial properties rather than solving prosaic problems.
I support the “focus on downstream tasks” pivot, but I’m somewhat skeptical of the “robustly useful setting” framing
Insofar as we’re worried about risks from future AI systems, all tasks are proxy tasks (which the post does a good job of pointing out). Then there’s a question of which tasks are better proxies, with two broad categories:
1. current problems that are analogous to future problems
2. artificially constructed problems (model organisms) which are designed to be analogous to future problems
I think 2 will tend to be more useful than 1, because problems with future AI system really are pretty different than problems with current AI systems. In particular, future AI systems will be very expensive to evaluate, situationally aware, and possibly scheming against us.
To take an example given in the post, pre-fill elicitation methods might work well on current models because current models are kind of dumb and not looking out for pre-fill attacks. However, I expect as models get smarter, more coherent etc, they will predictably become less susceptible to prefill elicitation (even setting aside transfer from explicit prefill-attack resistance training). One way of testing this now is to artificially give model organisms situational awareness of prefill attacks (either in-context or via SDF).
Note that this setup, as a model organism, would not quality as a “robustly useful setting”, and may indeed give opposite conclusions to experiments in more prosaic settings. But I expect that this setting would give more information about the usefulness of prefill vs white-box methods for future more capable AI systems than the more prosaic setting.
I think evan’s post on deep double descent looks really prescient (i.e. I think its now widely accepted that larger models tend to generalize better than smaller models conditioned on achieving the same training loss)
https://www.lesswrong.com/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-aroundthe implications for scheming risk are a little less clear: reasoning models don’t have strong speed priors (and do inherit simplicity priors from the NN), but don’t seem to be schemers (perhaps due to output-to-thinking generalization). I don’t think we should update much from this though, given the narrow range of tasks and still-limited situational awareness.
Part of the subtext here being the very best people (on the relevant dimensions) will naturally run into lesswrong, x-risk, etc, such that “out-reach” (in the sense of uni-organizing, advertising, etc) isn’t that valuable on the current margin.
To “the very best”, doing high quality research is often the best “out-reach”.
unclear whether outcome-based training “leaking” into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety
on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable
Thanks! Yeah makes sense, something like extremely adversarial / exploitive information environment, coupled with general decay in elite culture (and mass addiction to vertical video, social media).
Any citations on electorate being more ‘illiterate’ (in the relevant sense) then, say, 1960.
I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.
My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:
Reading week (everyone reads the same set of papers / blog posts, discuss thought
Writing week: everyone prepares comment / shortform / post, then we trade and revise writers-workshop style
this also provides filtering (avoid spamming LW with bad comments) and maybe makes people feel more comfortable/confident about posting public writing
would love feedback, and curious if anyone has tried stuff like this.
(Another nice feature of this is it increases the odds of getting people hooked on LW, which imo should be a top priority for safety field-building, c.f. https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=HqQsNdp4bdp4nDn7G)
not very long (3-5 word phrases)
somewhat related (and useful for weak to strong type experiments), I found a large gap between decoding performance in the Qwen3-[8-32B] (No-Thinking) range on the “secret side contraints” from the Eliciting Secret Knowledge paper.
Should we try harder to solve the alignment problem?
I’ve heard the meme “we are underinvesting in solving the the damn alignment problem” a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.
Really cool work.
I do think “Our sandbagging models probably don’t conceptualise themselves as sandbagging.” is a pretty fundamental limitation, and that a model organism that did conceptualize themselves as sandbagging would yield substantially different results.
very excited for SDF + finetuning / rl MOs in this direction.