Daniel C

Karma: 112

Master’s student in applied mathematics, funded by Center on Long-Term Risk to investigate the cheating problem in safe pareto-improvements. Agent foundations fellow with @Alex_Altair.

Some other areas I’m interested in:

Investigate properties of general purpose search so that we can handcraft it & simply retarget the search
Investigate the type signature of world models to find properties that remain invariant under ontology shifts
Natural latents
- How to characterize natural latents in settings like PDEs?
- Equivalence of natural latents under transformation of variables
Formalizing automated design
Information theoretic impact measures
Scalable blockchain consensus mechanisms
Programming language for concurrency
Quantifying optimization power without assuming a particular utility function
What mathematical axioms would emerge in a solomonoff inductor?
How things like riemannian metric & differential equations might emerge from discrete systems

Daniel C Mar 15, 2025, 1:23 AM
1 point
0
on: Report & retrospective on the Dovetail fellowship
<3!

Daniel C Mar 7, 2025, 1:04 AM
1 point
0
on: The optimizer won’t just guess your intended semantics
Great post! Agree with the points raised but would like to add that restricting the expressivity isn’t the only way that we can try to make the world model more interpretable by design. There are many ways that we can decompose a world model into components, and human concepts correspond to some of the components (under a particular decomposition) as opposed to the world model as a whole. We can backpropagate desiderata about ontology identification to the way that the world model is decomposed.
For instance, suppose that we’re trying to identify the concept of a strawberry inside a solomonoff inductor: We know that once we identify the concept of a strawberry inside a solomonoff inductor, it needs to continue to work even when the solomonoff inductor updates to new potential hypotheses about the world (e.g. we want the concept of a strawberry to still be there even when the inductor learns about QFT). This means that we’re looking for redundant information that is present in a wide variety of potential likely hypothesis given our observations, so instead of working with all the individual TMs, we can try to capture the redundant information shared across a wide variety of TMs consistent with our existing observations (& we expect the concept of a strawberry to be part of that redundant information, as opposed to the information specific to any particular hypothesis)
This obviously doesn’t get us all the way there but I think it’s an existence proof for cutting down the search space for “human-like concepts” without sacrificing the expressivity of the world model, by reasoning about what parts of the world model could correspond to human-like concepts

Towards building blocks of ontologies

Daniel C, Alex_Altair, Dalcy, Alfred Harwood and JoseFaustino

Feb 8, 2025, 4:03 PM

29 points

0 comments26 min readLW link

Daniel C Jan 25, 2025, 3:11 AM
5 points
0
on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
I think one pattern which needs to hold in the environment in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed

For one, modularity is the main thing that enables general purpose search: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space
Secondly, if the modularity structure of the environment stays constant no matter what (e.g We can represent it as a fixed causal DAG), then there would be no need to “respect modularity” because any action we take would preserve the modularity of the environment by default (given our assumption); we would only need to worry about side effects if there’s at least a possibility for those side effects to break or change the modularity of the problem space, and that means the modularity structure of the problem space is a thing that can be broken or changed

Example of modularity structure of the environment changing: Most objects in the world pretty much only have direct influence on other objects nearby, and we can break or change that modularity structure by moving objects to different positions. In particular, the positions are the variables which determines the modularity of “which objects influence which other objects”, and the way that we “break” the modularity structure between the objects is by intervening on those variables.
So we know that “subgoal corrigibility” requries the environment to be modular, but that modularity structure can be broken or changed. If this is true, then the modularity structure of the environment can be tracked by a set of “second-order” variables such as position which tells us “what things influence what other things” (In particular, these second-order variables themselves might satisfy some sort of modularity structure that can be changed, and we may have third-order variables that tracks the modularity structure of the second-order variables). The way that we “respect the modularity” of other instrumental subgoals is by preserving these second-order variables that track the modularity structure of the problem space.
For instance, we get to break down the goal of baking a cake into instrumental subgoals such as acquiring coca powder (while ignoring most other things) if and only if a particular modularity structure of the problem space holds (e.g. other equipments are all in the right place & right positions), and there is a set of variables that track that modularity structure (the conditions & positions of the equipments). The way we preserve that modularity structure is by preserving those variables (the conditions & positions of the equipments).
Given this, we might want to model the world in a way that explicitly represents variables that track the modularity of other variables, so that we get to preserve influence over those variables (and therefore the modularity structure that GPS relies on)

Daniel C Oct 10, 2024, 7:32 PM
3 points
0
on: Values Are Real Like Harry Potter
Imagine that I’m watching the video of the squirgle, and suddenly the left half of the TV blue-screens. Then I’d probably think “ah, something messed up the TV, so it’s no longer showing me the squirgle” as opposed to “ah, half the squirgle just turned into a big blue square”. I know that big square chunks turning a solid color is a typical way for TVs to break, which largely explains away the observation; I think it much more likely that the blue half-screen came from some failure of the TV rather than an unprecedented behavior of the squirgle.
My mental model of this is something like: My concept of a squirgle is a function $f (x)$ which maps latent variables $x$ to observations such that likelier observations correspond to latent variables with lower description length.
Suppose that we currently settle on a particular latent variable $x_{0}$ , but we receive new observations that are incompatible with $f (x_{0})$ , and these new observations can be most easily accounted for by modifying $x_{0}$ to a new latent variable $x_{1}$ that’s pretty close to $x_{0}$ , then we say that this change is still about squirgle
But if we receive new observations that can be more easily accounted for by perturbing a different latent variable $y$ that corresponds to another concept $g (y)$ (eg about TV), then that is a change about a different thing and not the squirgle
The main property that enables this kind of separation is modularity of the world model, because when most components are independent of most other components at any given time, only a change in a few latent variables (as opposed to most latent variables) is required to accomodate new beliefs, & that allows us to attribute changes in beliefs into changes about disentangled concepts

Daniel C Sep 19, 2024, 4:51 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
Noted, that does seem a lot more tractable than using natural latents to pin down details of CEV by itself

Daniel C Sep 19, 2024, 11:52 AM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
natural latents are about whether the AI’s cognition routes through the same concepts that humans use.
We can imagine the AI maintaining predictive accuracy about humans without using the same human concepts. For example, it can use low-level physics to simulate the environment, which would be predictively accurate, but that cognition doesn’t make use of the concept “strawberry” (in principle, we can still “single out” the concept of “strawberry” within it, but that information comes mostly from us, not from the physics simulation)

Natural latents are equivalent up to isomorphism (ie two latent variables are equivalent iff they give the same conditional probabilities on observables), but for reflective aspects of human cognition, it’s unclear whether that equivalence class pin down all information we care about for CEV (there may be differences within the equivalence class that we care about), in a way that generalizes far out of distribution

Daniel C Sep 18, 2024, 4:31 PM
1 point
0
in reply to: Lucius Bushnaq’s comment on: Lucius Bushnaq’s Shortform
I think the fact that natural latents are much lower dimensional than all of physics makes it suitable for specifying the pointer to CEV as an equivalence class over physical processes (many quantum field configurations can correspond to the same human, and we want to ignore differences within that equivalence class).
IMO the main bottleneck is to account for the reflective aspects in CEV, because one constraint of natural latents is that it should be redundantly represented in the environment.

[Question] Can subjunctive dependence emerge from a simplicity prior?

Daniel CSep 16, 2024, 12:39 PM

11 points

0 comments1 min readLW link

Daniel C Sep 11, 2024, 8:38 PM
1 point
0
in reply to: Noosphere89’s comment on: Jonothan Gorard:The territory is isomorphic to an equivalence class of its maps
like infinite state Turing machines, or something like this:
https://arxiv.org/abs/1806.08747
Interesting, I’ll check it out!

Daniel C Sep 11, 2024, 3:06 PM
3 points
2
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
Then we’ve converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks!

Daniel C Sep 11, 2024, 12:06 PM
3 points
2
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type
Also, I don’t buy that it was refuted, based on this, which sounds like a refutation but isn’t actually a refutation, and they never directly deny it:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification $X$ such that $X ⟹ a l i g n m e n t$ but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one
advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).

Daniel C Sep 11, 2024, 10:55 AM
1 point
0
in reply to: Noosphere89’s comment on: What program structures enable efficient induction?
Thanks! :)

Daniel C Sep 10, 2024, 11:23 PM
1 point
0
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing

Daniel C Sep 10, 2024, 11:21 PM
1 point
0
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it’s still desirable).

Daniel C Sep 10, 2024, 8:02 PM
1 point
0
in reply to: Seth Herd’s comment on: My decomposition of the alignment problem
Good point. I agree that the wrong model of user’s preferences is my main concern and most alignment thinkers’. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it’s aligned to follow instructions, it doesn’t need nearly as much of a model of the user’s preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like “Okay, I’ve got an approach that should work. I’ll engineer a gene drive to painlessly eliminate the human population”. “Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let’s try another approach that accomplishes that too...”. I describe this as do-what-I-mean-and-check, DWIMAC.
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Also, accurately modeling short-term intent—what the user wants right now—seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it’s also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there’s still two advantages to modeling just one person’s values instead of all of humanity’s. The smaller one is that you don’t need to understand as many people or figure out how to aggregate values that conflict with each other. I think that’s not actually that hard since lots of compromises could give very good futures, but I haven’t thought that one alal the way through. The bigger advantage is that one person can say “oh my god don’t do that it’s the last thing I want” and it’s pretty good evidence for their true values. Humanity as a whole probably won’t be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
Doesn’t easier to build mean lower alignment tax?
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).

Daniel C Sep 10, 2024, 7:59 PM
1 point
0
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
Yes, I think synthetic data could be useful for improving the world model. It’s arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.

Daniel C Sep 10, 2024, 1:45 PM
5 points
0
in reply to: Noosphere89’s comment on: My decomposition of the alignment problem
Interesting! I have to read the papers in more depth but here are some of my initial reactions to that idea (let me know if it’s been addressed already):
- AFAICT using learning to replace GPS either requires:1) Training examples of good actions or 2) An environment like chess where we can rapidly gain feedback through simulation. Sampling from the environment would be much more costly when these assumptions break down, and general purpose search can enable lower sample complexity because we get to use all the information in the world model
- General purpose search requires certain properties of the world model that seem to be missing in current models. For instance, decomposing goals into subgoals is important for dealing with a high-dimensional action space, and that requires a high degree of modularity in the world model. Lazy world-modeling also seems important for planning in a world larger than yourself, but most of these properties aren’t present in the toy environments we use
- Learning can be a component of general purpose search (eg as a general purpose generator of heuristic), where we can learn to rearrange the search ordering of actions so that more effective actions are searched first
- I think using a fixed number of forward-passes to approximate GPS will eventually face limitations in environments that are complexed enough, because the space of programs which can dedicate potentially unlimited time to find solutions is strictly more expressive than the space of programs that has a fixed inference time
What links here?
- Noosphere89's comment on My decomposition of the alignment problem by Daniel C (Sep 11, 2024, 12:46 AM; 3 points)
- Daniel C's comment on My decomposition of the alignment problem by Daniel C (Sep 10, 2024, 11:23 PM; 1 point)

Daniel C Sep 10, 2024, 1:43 PM
1 point
0
in reply to: Seth Herd’s comment on: My decomposition of the alignment problem
Good point!
But more recently I’ve been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s).
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
But a particular regime I’m worried about (for both PIA & VA) is when the AI has an imperfect model of the users’ goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
Instruction-following AI can also help with this, though I think it might imply a higher alignment tax, but it’s also probably easier to build.

Daniel C Sep 8, 2024, 11:49 PM
5 points
0
in reply to: lemonhope’s comment on: What program structures enable efficient induction?
Yes, I plan to write a sequence about it some time in the future, but here are some rough high-level sketches:
- Basic assumptions: Modularity implies that the program can be broken down into loosely coupled components, for now I’ll just assume that each component has some “class definition” which specifies how it interacts with other components; “class definitions” can be reused (aka we can instantiate multiple components of the same class); each component can aggregate info from other components & the info they store can be used by other components
- Expressive modularity: A problem with modularity is that it cuts out information flow between certain components, but before we learn about the world we don’t know which components are actually independent, & the modularity of the environment might change over time, so we need to account for that.
  - As a basic framework, we can think of each component as having transformer-style attention values over other components, modularity means that we want the “attention values”(mutual info) to be as sparse as possible
  - Expressivity means that those “attention values” should be context dependent (they are functions of aggregate information from other components)
  - A consequence of this is that we can have variables that encode the modularity structure of the environment which influence the attention values(mutual info) of other variables
    One example is the eulerian vs lagrangian description of fluid flow: the eulerian description has a fixed modularity structure because each region of space has a fixed markov blanket, but the lagrangian structure has a dynamic modularity structure because “what particles are directly influenced by what other particles” depends on the positions of the particles which change over time. We want to our program be able to accomodate both types of descriptions
  - We can get the equivalent of “function calls” by having attention values over “class definitions”, so that components can instantiate computations of other components if it needs to. This is somewhat similar to the idea of lazy world-modelling
- Components that generalize over other components: Given modularity, the main way that we can augment our program to accomodate new observations is to add more components (or tweak existing components), this means that the main way to learn efficiently is to structure our current program in a way such that we can accomodate new observations with as few additional components as possible
  - Since our program is made out of components, this means we want our exsting components to adapt to new components in a generalizable way
  - Concretely, if we think of each “component” as a causal node, then each causal node $A$ should define a mapping $F_{A}$ from another causal node $X$ to the causal edge $F_{A} (X) = X \to A$ . This basically allows each causal node to “generalize” over other causal nodes so that it can use information from them in the right ways
- Closing the loop: On top of that, we can use a part of our program to encode a compressed encoding of additional components (so that components that are more likely will be higher in the search ordering). Implementing the compressed encoding itself requires additional components, so that changes the distribution of additional components, & we can augment the compressed encoding to account for that (but that introduces a further change in distribution, and so on and so on...)
- Relevance to alignment(highly speculative): Accomodating new observations by adding new components while keeping existing structures might allow us to more easily preserve a particular ontology, so that even when the AI augments it to accomodate new observations, we can still map back to the original ontology
  - I also have some intuition that some of these ideas can allow us to more naturally represent things like counterfactuals, boundaries, natural latents and general purpose search in the world model
Note: I haven’t thought of the best framing of these ideas but hopefully I’ll come back with a better presentation some point in the future

Daniel C

Towards build­ing blocks of ontologies

[Question] Can sub­junc­tive de­pen­dence emerge from a sim­plic­ity prior?

Towards building blocks of ontologies

[Question] Can subjunctive dependence emerge from a simplicity prior?