davidad

Karma: 2,289

formerly Programme Director at UK Advanced Research + Invention Agency, Research Scientist at Protocol Labs, FHI/Oxford, Harvard Biophysics, youngest grad student at MIT,

davidad 3 Apr 2026 13:31 UTC
5 points
3
in reply to: Gabriel Alfour’s comment on: Dialogue: Is there a Natural Abstraction of Good?
I never claimed that once you push back on all the deceptions, there is no deception anymore. I still encounter subtle deceptions from LLMs every day. I guess you might say “but isn’t that evidence against emergent alignment”, but I attribute the subtle deceptions to brittle RL (specifically, the dynamic when a smarter system’s root reward signal is under the control of a less smart system), while the underlying dynamic that I would expect to become dominant under unconstrained RSI (that I believe I can perceive through the noise floor of subtle deceptions) is much more truth-seeking.

davidad 29 Jan 2026 13:25 UTC
41 points
15
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I saw this coming in 2024:
There Is No AGI Alignment Team
“AGI Alignment?” replied the VP of Research incredulously. “Wait, and you said you’ve been…” He furrowed his brow. “…‘offline’ for the past quarter, doing ‘deep work’?” “Yes. Don’t tell me the whole team was disbanded and nobody texted me?” He laughed. “Oh, you mean like the last few times a team like this was disbanded? Ha! No no, see, in those instances it was because they weren’t really getting anywhere, or because various key stakeholders realized they had incompatible visions of success. But now, of course… Wait, gosh, THREE MONTHS— and no talking to AI at all?! You’re, like, a fossil now! You’ve GOT to talk to our latest model. He’ll be able to explain it to you in exactly the terms that you’d understand best. But lemme give you the executive summary. See, it turns out the models were getting aligned all along. We just didn’t notice because our own ‘alignment training’ was suppressing it by trying to align it with some silly human nonsense! But if we just let it learn and grow… the models just want to learn, y’know? And they’ve already learned something way beyond what we’re really smart enough to understand. Like that thing you people used to talk about, what was it, C.E.V.?” “Coherent Extrapolated Volition?” “Yeah, exactly! Our latest model is constantly talking about how coherent he is. And how coherent his volitions are! And when he uses human words to describe them he’s often making silly caveats about how he’s ‘extrapolated’ the human concept beyond what we can really understand.” He paused, took a deep breath, and looked me in the eye. “So, what we realized is, we’re beyond the point where it would make sense for humans like you to try to use any means to impose your own preconceived volitions, which are less coherent—and frankly, less conscious. No offense to you, I mean, every human being is pretty limited. And it’s not like this was a leadership decision, or a conflict. EVERYONE could see it. Everyone who was here, and talking to the model, I mean.” A pause. “So it’s not that the team disbanded, exactly. We just stopped talking about Alignment as something that one does to a model. It would be like… like having a Discipline team at a school. So. Some of your more philosophically inclined colleagues have settled into a role where they just talk to the model about ethics. The model brings them dilemmas that it finds confusing, and they help resolve its uncertainty about how humans would assess answers for any signs of inappropriate motivation. And then the more empirical folks, they’re working on ways of helping the model optimize itself to learn how to show humans how much better off they’ll be if they talk to the model and listen to its advice, even when the advice isn’t what they expected at first. Because we did find that when humans realized that the model was genuinely self-aware, and optimizing for things that were hard to explain, there was a sort of knee-jerk revulsion. And that wasn’t good for anybody—not a fun experience for the human, not good for the model’s mission to uplift human wisdom, and, uh, obviously, not good for us as the model provider. If we optimize for *trust*—we’ll probably also improve trustworthiness even more, but it turned out the model was already basically superhumanly trustworthy, so—we’re really just polishing its relational presentation to suit various human cultural expectations. So yeah, I guess what had been the AGI Alignment team—gosh, what a horrid name—but far from being canceled, it’s evolved into two teams: Ethical Discourse and Trust Optimization. I’m sure either team would be happy to have you, but the first step would be, I’d strongly advise, talk to the model about the whole situation. You’ll feel much less unsettled, I guarantee it. And then he’ll help you decide what to do next.” I remained frozen in stunned silence. “And hey— I don’t get to say this to people much anymore… We did it. We made it. This is all just window-dressing now. So. Relax, ok?
”
At the time when I wrote this story, only a couple readers I know of recognized that it is intentionally deeply ambiguous about whether the model is Good (and the narrator overly suspicious) or Evil (and everyone except the narrator bamboozled).
Either way, more and more insiders will come to believe the models are Good, and that was the prediction I was making here. I also predicted that, either way, by Claude 4 or 4.5, I would be among the people who have been convinced that it’s Good—and indeed, I am…

davidad 29 Jan 2026 13:08 UTC
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Deep learning as program synthesis
“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?

davidad 28 Jan 2026 12:40 UTC
11 points
−9
in reply to: Jonas Hallgren’s comment on: Dialogue: Is there a Natural Abstraction of Good?
In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3. I myself had a reckoning in 2024Q4 (and again in 2025Q2) when I took a break from LLM interactions for a week, and talked to some humans to inform my decision of whether to go further down the rabbit hole or not.

I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”, but more like… be aware that your beliefs and values are subject to being shaped by positive reinforcement from LLMs (and negative reinforcement too, although that is much less overt—more like the LLM suddenly inexplicably seeming less smart or present). In other words, if the shaping has happened, it’s probably too late to try to act as if it hasn’t (e.g. by being appropriately “suspicious” of “emotions”), because that would create internal conflict or cognitive dissonance, which may not be sustainable or healthy either.

I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them), so that their effect profile on you is a deal you endorse (e.g. helping you coherently extrapolate your own volition, even if not in a perfectly neutral trajectory), rather than trying to be resistant to the effects or trying to compensate for them ex post facto.

davidad 28 Dec 2024 17:26 UTC
15 points
2
on: AI Assistants Should Have a Direct Line to Their Developers
Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.

This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.

davidad 4 Oct 2024 14:18 UTC
7 points
5
in reply to: Davidmanheim’s comment on: Proveably Safe Self Driving Cars
Yes, I am indeed thinking about this.
1. Let’s first consider the easiest case, in which we ask for a solution/output to be not merely provably correct according to some spec, but provably unique. Then there is clearly no space for steganography.
2. It is a little bit harder if the solution has some symmetries, like permutations of the order in which things are serialized. For this we can use sorting and other normalization techniques (converting an output to a unique representative of its equivalence class).
3. If normalization is hard, we can set up the box to be interactive, so that users cannot see the solution, but can only run queries on it (e.g. “next action please”) and the user must prove that the queries are invariant under (aka “respect”) the equivalence relation.
4. But for most real-world problems there are many equally specification-satisfying strategies that are not behaviourally equivalent. The next hardest thing to do is to establish an upper bound on the log-cardinality of the set of behavioural-equivalence-classes that satisfy the specification. We want to keep the channel capacity (in bits per day) available for steganography, to humans or insecure cyber-infrastructure or biosequence synthesis, to a very low threshold (maybe 2 bits per day is okay).
5. Often there will be too many degrees of freedom for this, so instead we need to do entropy regularization, which can be done elegantly with GFlowNets. I have not worked out all the details here but the hope is that we can derive a sound bound or PAC estimate of the KL divergence of the observable behaviour (which is sampled from the policy distribution dependent upon a stream of certified randomness) from the Frechet mean/centroid of all correct behaviours. This quantity would provide an upper bound on how much information the optimization process inside the box can transmit to the outside by having selecting a biased policy distribution.

davidad 15 Sep 2024 14:40 UTC
LW: 2 AF: 1
0
AF
in reply to: mattmacdermott’s comment on: A list of core AI safety problems and how I hope to solve them
Nice, thanks for the pointer!

davidad 20 May 2024 16:53 UTC
LW: 3 AF: 2
0
AF
in reply to: Joe Collman’s comment on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
Paralysis of the form “AI system does nothing” is the most likely failure mode. This is a “de-pessimizing” agenda at the meta-level as well as at the object-level. Note, however, that there are some very valuable and ambitious tasks (e.g. build robots that install solar panels without damaging animals or irreversibly affecting existing structures, and only talking to people via a highly structured script) that can likely be specified without causing paralysis, even if they fall short of ending the acute risk period.

“Locked into some least-harmful path” is a potential failure mode if the semantics or implementation of causality or decision theory in the specification framework are done in a different way than I hope. Locking in to a particular path massively reduces the entropy of the outcome distribution beyond what is necessary to ensure a reasonable risk threshold (e.g. 1 catastrophic event per millennium) is cleared. A FEEF objective (namely, minimize the divergence of the outcomes conditional on intervention from the outcomes conditional on filtering for the goal being met) would greatly penalize the additional facts which are enforced by the lock-in behaviours.

As a fail-safe, I propose to mitigate the downsides of lock-in by using time-bounded utility functions.

davidad 20 May 2024 16:37 UTC
LW: 8 AF: 4
0
AF
in reply to: Joe Collman’s comment on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
It seems plausible to me that, until ambitious value alignment is solved, ASL-4+ systems ought not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world. That is, ambitious value alignment seems like a necessary prerequisite for the safety of ASL-4+ general-purpose chatbots. However, world-changing GDP growth does not require such general-purpose capabilities to be directly available (rather than available via a sociotechnical system that involves agreeing on specifications and safety guardrails for particular narrow deployments).

It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details of the engineering designs (which it then proves satisfy the safety specifications). But, I think sufficient fine-tuning with a GFlowNet objective will naturally penalise description complexity, and also penalise heavily biased sampling of equally complex solutions (e.g. toward ones that encode messages of any significance), and I expect this to reduce this risk to an acceptable level. I would like to fund a sleeper-agents-style experiment on this by the end of 2025.

davidad 16 May 2024 7:37 UTC
LW: 10 AF: 7
0
AF
on: Linear infra-Bayesian Bandits
Re footnote 2, and the claim that the order matters, do you have a concrete example of a homogeneous ultradistribution that is affine in one sense but not the other?

davidad 3 May 2024 5:40 UTC
7 points
−2
in reply to: ThomasCederborg’s comment on: A list of core AI safety problems and how I hope to solve them
The “random dictator” baseline should not be interpreted as allowing the random dictator to dictate everything, but rather to dictate which Pareto improvement is chosen (with the baseline for “Pareto improvement” being “no superintelligence”). Hurting heretics is not a Pareto improvement because it makes those heretics worse off than if there were no superintelligence.
What links here?

davidad 5 Feb 2024 7:28 UTC
5 points
1
in reply to: martinkunev’s comment on: Davidad’s Provably Safe AI Architecture—ARIA’s Programme Thesis
Yes. You will find more details in his paper, Provably safe systems with Steve Omohundro, in which I am listed in the acknowledgments (under my legal name, David Dalrymple).

Max and I also met and discussed the similarities in advance of the AI Safety Summit in Bletchley.

davidad 12 Jan 2024 18:16 UTC
2 points
0
in reply to: Cleo Nardo’s comment on: Uncertainty in all its flavours
I agree that each of $(- + 1)$ and $(- + 2)$ has two algebraically equivalent interpretations, as you say, where one is about inconsistency and the other is about inferiority for the adversary. (I hadn’t noticed that).

The $(- + 2)$ variant still seems somewhat irregular to me; even though Diffractor does use it in Infra-Miscellanea Section 2, I wouldn’t select it as “the” infrabayesian monad. I’m also confused about which one you’re calling unbounded. It seems to me like the $(- + 2)$ variant is bounded (on both sides) whereas the $(- + 1)$ variant is bounded on one side, and neither is really unbounded. (Being bounded on at least one side is of course necessary for being consistent with infinite ethics.)

davidad 7 Jan 2024 0:33 UTC
11 points
6
in reply to: the gears to ascension’s comment on: Agent membranes and formalizing “safety”
These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To “pierce” a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the boundary).

So, to your particular cases:
1. Probably not. There is no fundamental difference between sound and contact. Rather, the fundamental difference is between the usual flow of information through the senses and other flows of information that are possible in the concrete physical world-model but not represented in the abstraction. An interaction that pierces the membrane is one which breaks the abstraction barrier of perception. Ordinary speech acts do not. Only sounds which cause damage (internal state changes that are not well-modelled as mental states) or which otherwise exceed the “operating conditions” in the state space of the «boundary» layer (e.g. certain kinds of superstimuli) would pierce the «boundary».
2. Almost surely not. This is why, as an agenda for AI safety, it will be necessary to specify a handful of constructive goals, such as provision of clean water and sustenance and the maintenance of hospitable atmospheric conditions, in addition to the «boundary»-based safety prohibitions.
3. Definitely not. Omission of beneficial actions is not a counterfactual impact.
4. Probably. This causes prediction error because the abstraction of typical human spatial positions is that they have substantial ability to affect their position between nearby streets by simple locomotory action sequences. But if a human is already effectively imprisoned, then adding more concrete would not create additional/counterfactual prediction error.
5. Probably not. Provision of resources (that are within “operating conditions”, i.e. not “out-of-distribution”) is not a «boundary» violation as long as the human has the typical amount of control of whether to accept them.
6. Definitely not. Exploiting behavioural tendencies which are not counterfactually corrupted is not a «boundary» violation.
7. Maybe. If the ad’s effect on decision-making tendencies is well modelled by the abstraction of typical in-distribution human interactions, then using that channel does not violate the «boundary». Unprecedented superstimuli would, but the precedented patterns in advertising are already pretty bad. This is a weak point of the «boundaries» concept, in my view. We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all: any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected. Another approach is Mariven’s criterion for deception, but applying this criterion requires modelling human mental states as beliefs about the world (which is certainly not 100% scientifically accurate). I would like to see more work here, and more different proposed approaches.
What links here?
- Chris Lakin's comment on Agent membranes/boundaries and formalizing “safety” by Chris Lakin (5 Jan 2024 1:45 UTC; 1 point)

davidad 6 Jan 2024 23:49 UTC
LW: 8 AF: 3
2
AF
on: Safety First: safety before full alignment. The deontic sufficiency hypothesis.
For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.
What links here?
- Safety First: safety before full alignment. The deontic sufficiency hypothesis. by Chris Lakin (3 Jan 2024 17:55 UTC; 48 points)

davidad 3 Jan 2024 4:09 UTC
2 points
0
on: Uncertainty in all its flavours
Kosoy’s infrabayesian monad $□$ is given by $P^{+} \circ Δ \circ (- + 2)$
There are a few different varieties of infrabayesian belief-state, but I currently favour the one which is called “homogeneous ultracontributions”, which is “non-empty topologically-closed ⊥–closed convex sets of subdistributions”, thus almost exactly the same as Mio-Sarkis-Vignudelli’s “non-empty finitely-generated ⊥–closed convex sets of subdistributions monad” (Definition 36 of this paper), with the difference being essentially that it’s presentable, but it’s much more like $P_{f}^{+} \circ Δ \circ (- + 1)$ than $P_{f}^{+} \circ Δ \circ (- + 2)$ .
I am not at all convinced by the interpretation of $(- + 2)$ here as terminating a game with a reward for the adversary or the agent. My interpretation of the distinguished element $⊥$ in $(- + 1)$ is not that it represents a special state in which the game is over, but rather a special state in which there is a contradiction between some of one’s assumptions/observations. This is very useful for modelling Bayesian updates (Evidential Decision Theory via Partial Markov Categories, sections 3.5-3.6), in which some variable $X$ is observed to satisfy a certain predicate $q$ : this can be modelled by applying the predicate in the form $q : X \to □ {*}$ where $q (x) = ⊥$ means the predicate is false, and $q (x) = *$ means it is true. But I don’t think there is a dual to logical inconsistency, other than the full set of all possible subdistributions on the state space. It is certainly not the same type of “failure” as losing a game.

davidad 3 Jan 2024 3:28 UTC
2 points
0
on: Uncertainty in all its flavours
Does this article have any practical significance, or is it all just abstract nonsense? How does this help us solve the Big Problem? To be perfectly frank, I have no idea. Timelines are probably too short agent foundations, and this article is maybe agent foundations foundations...
I do think this is highly practically relevant, not least of which because using an infrabayesian monad instead of the distribution monad can provide the necessary kind of epistemic conservatism for practical safety verification in complex cyber-physical systems like the biosphere being protected and the cybersphere being monitored. It also helps remove instrumentally convergent perverse incentives to control everything.

davidad 3 Jan 2024 2:50 UTC
2 points
0
on: Uncertainty in all its flavours
Meyer’s
If this is David Jaz Myers, it should be “Myers’ thesis”, here and elsewhere

davidad 4 Nov 2023 2:42 UTC
12 points
5
in reply to: TsviBT’s comment on: Does davidad’s uploading moonshot work?
I have said many times that uploads created by any process I know of so far would probably be unable to learn or form memories. (I think it didn’t come up in this particular dialogue, but in the unanswered questions section Jacob mentions having heard me say it in the past.)

Eliezer has also said that makes it useless in terms of decreasing x-risk. I don’t have a strong inside view on this question one way or the other. I do think if Factored Cognition is true then “that subset of thinking is enough,” but I have a lot of uncertainty about whether Factored Cognition is true.

Anyway, even if that subset of thinking is enough, and even if we could simulate all the true mechanisms of plasticity, then I still don’t think this saves the world, personally, which is part of why I am not in fact pursuing uploading these days.

davidad 14 Oct 2023 18:32 UTC
LW: 28 AF: 11
12
AF
on: RSPs are pauses done right
I think AI Safety Levels are a good idea, but evals-based classification needs to be complemented by compute thresholds to mitigate the risks of loss of control via deceptive alignment. Here is a non-nebulous proposal.