Hi all! I’m a long-time LWer, but I’m making a comment thread here so that my research fellows can introduce themselves under it!
For the past year or so I’ve been running the Dovetail research fellowship in agent foundations with @Alfred Harwood. We like to have our fellows make LW posts about what they worked on during the fellowship, and everyone needs a bit of karma to get started. Here’s a place to do that!
I’m Santiago Cifuentes, and I’ve been a Dovetail Fellow since November 2025 working on Agentic Foundations. My current research project consists in extending previous results that aim to characterize which agents contain world models (such as https://arxiv.org/pdf/2506.01622). On a similar line, I would like to provide a more general definition of what a world model is!
I’ve been silently lurking LessWrong since 2023, and I came across the forum while looking for rationality content (and in particular I found The Sequences quite revealing). I am looking forward to contribute to the discussion!
I am Margot Stakenborg, and I have worked with Dovetail in this winter fellowship cohort. I have a background in theoretical physics and philosophy of physics, and now making a switch into conceptual mechinterp, after having been interested in it and learning about it for some years. I have been working with Dovetail on formalising world models, I am writing up a sequence of posts on the philosophical and mathematical prerequisites for proper world models, and which tools from physics can help us understand and analyse different world models, and I will dive into the different definitions of “world model” that float around in mechinterp and AI safety literature. Things I will discuss are:
How is the concept “world model” used in different areas of ML literature
Concept representation in the brain: new frontiers from neuroscience
Tools from physics: renormalisation and coarse-graining
What are “natural features”?
When can networks find similar representations of the world as we do?
Can NNs discover new natural kinds?
Theoretical equivalence and intertheoretic reduction
Bayesian experimental design
And probably more..
I hope to build this out into a quite comprehensive and complete sequence. Do let me know if there are other questions or subjects you would be interested in to read about!
Hello! I’m Guillermo, a fellow in the Winter25 cohort.
I have a background in mathematics, computer science and particularly computational neuroscience.
In my project I am looking at the Reward Hypothesis in decision theory and reinforcement learning theory and would like to write a digest of what are the main results that connect a preference order from order-preserving functions, to expected utility maximization and reward functions (with discount factors). I furthermore would like to formalize some of the key results in Lean.
Overall I am interested in topics that connect rationality and decision theory all the way to practical aspects of machine learning and reinforcement learning, to try to bridge these topics for AI Safety.
My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:
Is the Platonic Representation Hypothesis from deep learning evidence for the Natural Abstractions Hypothesis?
Is it possible to construct a dataset which represents the world in a completely unbiased way?
How can Natural Abstractions be both universal & observer/goal dependant?
What would it take to empirically test the NAH?
I’ve been lurking on LessWrong since 2024, when I got interested in AI Safety, and am very excited to spend more time engaging with the community.
I am Léo Cymbalista, one of the Dovetail fellows since November 2025. I’m a physics undergraduate in the process of switching to theoretical AI safety research. My current research project is writing an explainer for computational mechanics, which is almost finished. I hope to use the knowledge I acquired while researching for it to answer questions such as “given 2 coupled stochastic processes, when can we say that one is modeling the other?”, which could be useful for investigating the presence of world models in agents.
I have only known about LW (and AI safety) for about a year, so I’m still not very familiar with it, but it seems very interesting so far!
Hi, I’m Vardhan, one of the Dovetail fellows this winter. Thanks Alex & Alfred for running this!
Background: I study mathematics and computer science (probability, algorithms, game theory) and I’m interested in formal models of agents and multi-agent interaction.
For the fellowship, I looked at the question: Which agents can be faithfully described by finite automata / finite transducers, and which structural properties make that more or less likely? In other words, when can an agent’s externally observable behavior be captured by a finite (possibly stochastic) automaton, and what observable signatures indicate that a finite-state model is impossible or misleading?
I’ve written a brief report summarizing definitions, toy examples, and some light lemmas. I’m planning a longer post with formal definitions, more examples, and proofs. I’d really appreciate recommendations on literature I may have missed (especially anything linking automata/dynamical-systems perspectives to algorithmic information theory, ergodic theory, or learning theory). Comments, questions, and pointers very welcome!
Hi all! I’m a long-time LWer, but I’m making a comment thread here so that my research fellows can introduce themselves under it!
For the past year or so I’ve been running the Dovetail research fellowship in agent foundations with @Alfred Harwood. We like to have our fellows make LW posts about what they worked on during the fellowship, and everyone needs a bit of karma to get started. Here’s a place to do that!
Hi everyone!
I’m Santiago Cifuentes, and I’ve been a Dovetail Fellow since November 2025 working on Agentic Foundations. My current research project consists in extending previous results that aim to characterize which agents contain world models (such as https://arxiv.org/pdf/2506.01622). On a similar line, I would like to provide a more general definition of what a world model is!
I’ve been silently lurking LessWrong since 2023, and I came across the forum while looking for rationality content (and in particular I found The Sequences quite revealing). I am looking forward to contribute to the discussion!
Hi everyone!
I am Margot Stakenborg, and I have worked with Dovetail in this winter fellowship cohort. I have a background in theoretical physics and philosophy of physics, and now making a switch into conceptual mechinterp, after having been interested in it and learning about it for some years. I have been working with Dovetail on formalising world models, I am writing up a sequence of posts on the philosophical and mathematical prerequisites for proper world models, and which tools from physics can help us understand and analyse different world models, and I will dive into the different definitions of “world model” that float around in mechinterp and AI safety literature. Things I will discuss are:
How is the concept “world model” used in different areas of ML literature
Concept representation in the brain: new frontiers from neuroscience
Tools from physics: renormalisation and coarse-graining
What are “natural features”?
When can networks find similar representations of the world as we do?
Can NNs discover new natural kinds?
Theoretical equivalence and intertheoretic reduction
Bayesian experimental design
And probably more..
I hope to build this out into a quite comprehensive and complete sequence. Do let me know if there are other questions or subjects you would be interested in to read about!
Hello! I’m Guillermo, a fellow in the Winter25 cohort.
I have a background in mathematics, computer science and particularly computational neuroscience.
In my project I am looking at the Reward Hypothesis in decision theory and reinforcement learning theory and would like to write a digest of what are the main results that connect a preference order from order-preserving functions, to expected utility maximization and reward functions (with discount factors). I furthermore would like to formalize some of the key results in Lean.
Overall I am interested in topics that connect rationality and decision theory all the way to practical aspects of machine learning and reinforcement learning, to try to bridge these topics for AI Safety.
Nice to meet you all!
Hi everyone!
My name is Robert Adragna, and I’ve been working with Dovetail this winter fellowship cohort on Agent Foundations. Specifically, I’ve been trying to better understand what background assumptions the Natural Abstractions Hypothesis (NAH) makes about the world, and whether they might be learned in existing LLM systems. Specific questions that I’m exploring include:
Is the Platonic Representation Hypothesis from deep learning evidence for the Natural Abstractions Hypothesis?
Is it possible to construct a dataset which represents the world in a completely unbiased way?
How can Natural Abstractions be both universal & observer/goal dependant?
What would it take to empirically test the NAH?
I’ve been lurking on LessWrong since 2024, when I got interested in AI Safety, and am very excited to spend more time engaging with the community.
Hello everyone!
I am Léo Cymbalista, one of the Dovetail fellows since November 2025. I’m a physics undergraduate in the process of switching to theoretical AI safety research. My current research project is writing an explainer for computational mechanics, which is almost finished. I hope to use the knowledge I acquired while researching for it to answer questions such as “given 2 coupled stochastic processes, when can we say that one is modeling the other?”, which could be useful for investigating the presence of world models in agents.
I have only known about LW (and AI safety) for about a year, so I’m still not very familiar with it, but it seems very interesting so far!
Hi, I’m Vardhan, one of the Dovetail fellows this winter. Thanks Alex & Alfred for running this!
Background: I study mathematics and computer science (probability, algorithms, game theory) and I’m interested in formal models of agents and multi-agent interaction.
For the fellowship, I looked at the question: Which agents can be faithfully described by finite automata / finite transducers, and which structural properties make that more or less likely? In other words, when can an agent’s externally observable behavior be captured by a finite (possibly stochastic) automaton, and what observable signatures indicate that a finite-state model is impossible or misleading?
I’ve written a brief report summarizing definitions, toy examples, and some light lemmas. I’m planning a longer post with formal definitions, more examples, and proofs. I’d really appreciate recommendations on literature I may have missed (especially anything linking automata/dynamical-systems perspectives to algorithmic information theory, ergodic theory, or learning theory). Comments, questions, and pointers very welcome!