The Metaphysical Structure of Pearl’s Theory of Time
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says “[The Pearlian Theory of Time] … is the best thing to happen to our understanding of time since Einstein”. I read Pearl’s book on Causality[1], and while there’s math, this metaphysical connection that Scott seems to make isn’t really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl’s book.
Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.
Causality leave a shadow of conditional independence relationships over the observational distribution. Here’s an explanation providing the core intuition:
Suppose you represent the ground truth structure of [causality / determination] of the world via a Structural Causal Model over some variables, a very reasonable choice. Then, as you go down the Pearlian Rung: SCM →[2] Causal Bayes Net →[3] Bayes Net, theorems guarantee that the Bayes Net is still Markovian wrt the observational distribution.
Causal Discovery then (at least in this example) reduces to inferring the equation assignment directions of the SCM, given only the observational distribution.
The earlier result guarantees that all you have to do is find a Bayes Net that is Markovian wrt the observational distribution. Alongside the faithfulness assumption, this thus reduces to finding a Bayes Net structure G whose set of independencies (implied by d-separation) are identical to that of P (or, finding the Perfect Map of a distribution[4]).
Then, at least some of the edges of the Perfect Map will have its directions nailed down by the conditional independence relations.
The metaphysical claim is that, this direction is the definition of time[5], morally so, based on the intuition provided by the example above.
So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.
Abstracting away, the structure of any Theory of Time is then to:
find a mathematical structure [in the Pearlian Theory of Time, a Bayes Net]
… that has gadgets [d-separation]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [parenthood relationship of nodes in a Bayes Net]
… while this order induced from the gadget coinciding to that of d-separation [trivially so here, because we’re talking about Bayes Nets and d-separation] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
This is exactly what Factored Space Model does:
find a mathematical structure [Factored Space Model]
… that has gadgets [structural independence]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [preorder relation induced by the subset relationship of the History]
… while this order induced from the gadget coinciding to that of d-separation [by a theorem of FSM] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
while, additionally, generalizing the scope of our Theory of Time from [variables that appear in the Bayes Net] to [any variables defined over the factored space].
… thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.
By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.
By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.
This approach goes back to Hans Reichenbach’s book The Direction of Time. I think the problem is that the set of independencies alone is not sufficient to determine a causal and temporal order. For example, the same independencies between three variables could be interpreted as the chains A→B→C and A←B←C. I think Pearl talks about this issue in the last chapter.
The critical insight is that this is not always the case!
Let’s call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified—namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:
More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:
Anyone interested in further detail should consult Pearl’s Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).
Ah yes, the fork asymmetry. I think Pearl believes that correlations reduce to causations, so this is probably why he wouldn’t particularly try to, conversely, reduce causal structure to a set of (in)dependencies. I’m not sure whether the latter reduction is ultimately possible in the universe. Are the correlations present in the universe, e.g. defined via the Albert/Loewer Mentaculus probability distribution, sufficient to recover the familiar causal structure of the universe?
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
eg crutchfield talks a lot about developing a right notion of information flow—obvious usefulness in eg formalizing boundaries?
many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.
Epsilon machine (and MSP) construction is most likely computationally intractable [I don’t know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios.
Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited.
The calculi of emergence paper has inspired many people but has not been developed much. Many of the details are somewhat obscure, vague. I also believe that most likely completely different methods are needed to push the program further. Computational Mechanics’ is primarily a theory of hidden markov models—it doesn’t have the tools to easily describe behaviour higher up the Chomsky hierarchy. I suspect more powerful and sophisticated algebraic, logical and categorical thinking will be needed here. I caveat this by saying that Paul Riechers has pointed out that actually one can understand all these gadgets up the Chomsky hierarchy as infinite HMMs which may be analyzed usefully just as finite HMMs.
The still-underdeveloped theory of epsilon transducers I regard as the most promising lens on agent foundations. This is uncharcted territory; I suspect the largest impact of computational mechanics will come from this direction.
Your point on True Names is well-taken. More basic examples than gauge information, synchronization order are the triple of quantites entropy rate h, excess entropy E and Crutchfield’s statistical/forecasting complexity C. These are the most important quantities to understand for any stochastic process (such as the structure of language and LLMs!)
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct U, and then the agent’s behavior is equivalent to a system that is maximizing U.
Says nothing about whether the agent internally constructs U and uses them.
(Little Less Silly version of the) Good regulator theorem: A regulator R that minimizes the entropy of a system variable S (where there is an environment variable X upstream of both R and S) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of S (despite being a function of X).
Says nothing about whether R actually internally reconstructs S and uses it to produce its output.
Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct G′ that is very similar to the true environment G. Then argue: “hence the agent must have something internally isomorphic to G”. Which is true, but …
says nothing about whether the agent actually uses those internal isomorphic-to-G structures in the causal history of computing its output.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
SLT represents a system as a parameterized function—very rich!
In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net—very impoverished!
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes S take minimum entropy, the first node must be isomorphic to the likelihood function s↦P(S=s|X).
This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph G, with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph E. Then claim: Agents with low regret must have G that admits an abstracting causal model map (summary) from E, and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
I’m being very very [imprecise/almost misleading] here—because I’m just trying to make a high-level point and the details don’t matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
Yeah, I think structural selection theorems matter a lot, for reasons I discussed here.
This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.
And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as “Program A, but different in ways X” then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a “distance” of at most the length of X in bits.
For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it’s probably of the form “A, but slightly different”, which means it’s agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)
It’s maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.
[Some thoughts that are similar but different to my previous comment;]
I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps.
Prove a behavioral theorem
add in a structural assumption
prove that behavioral result plus structural assumption implies structural result.
Behavior essentially serves as an “interface”, and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption).
Further claims: for any given structural class,
there will be a natural simplicity measure
simpler instances will be exponentially rare.
A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1]
So let’s say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it’s exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead.
Technically, you could have a subset of strings that didn’t grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather than exponentially. But this is clearly a less natural specification method.
There is a straightforward compmech take also.
If the goal of the agent is simply to predict well (let’s say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states).
Importantly the MSP must be used if optimal prediction is achieved.
There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs
Not much to add, I haven’t spent enough time thinking about structural selection theorems.
I’m a fan of making more assumptions. I’ve had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. “How could an agent store a utility function over all possible worlds?” or “Rice’s theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it’s impossible to do xyz”. The answer is always nah, it’s possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it’s easy to oversimplify the problem and end up with something useless.
I think I ger what you mean, though making more assumptions is perhaps not the best way to think about it. Logic is monotonic (classical logic at least), meaning that a valid proof remains valid even when adding more assumptions. The “taking advantage of some structure” seems to be different.
Hey, some thoughts in case helpful. I was exploring a little bit into the ‘agent structure’ sort of questions and the Good/Gooder regulator landscape.
You can take GR a bit further by looking at a temporally indexed MDP-like causal diagram and applying various bookkeeping transformations. Search ‘combine nodes’ in John’s post on Bayes net algebra and ‘uncombine’ in my comment on the same.
Then you can see a ‘good regulator motif’ across many timesteps and timescales and draw some richer conclusions.
The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.
Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.
Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.
Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.
Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:
Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).
Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2’s element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to “nonnegative linear combinations” of Shannon-type inequalities.
Then, the claim that “nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures” would correspond to the claim that Γn=Γ∗n for all n.
The content of the papers linked above is to show that:
This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can’t be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.
epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.
Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.
Solving a foundational problem to its implementation often takes the following steps (some may be skipped):
developing a philosophical problem
developing a solution given infinite computing power
developing an impractical solution
developing some sort of polytime solution
developing a practical solution
and he says that it is often during the 3 → 4 step in which understanding gets stuck and the most technical and brute-force math (and i would add sometimes philosophical) work is needed, because:
a common motif in 3) is that they’re able to proving interesting things about their solutions, like asymptotic properties, by e.g., having their algorithms iterate through all turing machines, hence somewhat conferring the properties of the really good turing machine solution that exists somewhere in this massive search space to the overall search algorithm (up to a massive constant, usually).
think of Levin’s Universal Search, AIXItl, Logical Induction.
he says such algorithms are secretly a black box algorithm; there are no real gears.
Meanwhile, algorithms in 4) have the opposite nature—they are polynomial often because they characterize exploitable patterns that make a particular class of problems easier than most others, which requires Real Understanding. So algorithms of 3) and 4) often look nothing alike.
I liked this post and the idea of the “3-4 chasm,” because it explicitly captures the vibes of why I personally felt the vibes that, e.g., AIT, might be less useful for my work: after reading this post, I realized that for example when I refer to the word “structure,” I’m usually pointing at the kind of insights required to cross the 3-4 gap, while others might be using the same word to refer to things at a different level. This causes me to get confused as to how some tool X that someone brought up is supposed to help with the 3-4 gap I’m interested in.[1]
Vanessa Cosoy refers to this post, saying (in my translation of her words) that a lot of the 3-4 gap in computational learning theory has to do with our lack of understanding of deep learning theory, like how the NP-complete barrier is circumvented in practical problems, what are restrictions we can put on out hypothesis class to make them efficiently learnable in the same way our world seems efficiently learnable, etc.
She mentions that this gap, at least in the context of deep learning theory, isn’t too much of a pressing problem because it already has mainstream attention—which explains why a lot of her work seems to lie in the 1-3 regime.
I asked GPT for examples of past crossings of the 3-4 chasm in other domains, and it suggested [Shannon’s original technically-constructive-but-highly-infeasible proof for the existence of optimal codes] vs. [recent progress on Turbocodes that actually approach this limit while being very practical], which seems like a perfect example.
I agree with this framing.
The issue of characterizing in what way Our World is Special is the core theoretical question of learning theory.
The way of framing it as a single bottleneck 3-4 maybe understates how large the space of questions is here. E.g. it encompasses virtually every field of theoretical computer science, and physics& mathematics relevant to computation outside of AIT and numerical math.
I’d vote for removing the stage “developing some sort of polytime solution” and just calling 4 “developing a practical solution”. I think listing that extra step is coming from the perspective of something who’s more heavily involved in complexity classes. We’re usually interested in polynomial time algorithms because they’re usually practical, but there are lots of contexts where practicality doesn’t require a polynomial time algorithm, or really, where we’re just not working in a context where it’s natural to think in terms of algorithms with run-times.
Tl;dr, agents selected to perform robustly in various local interventional distributions must internally represent something isomorphic to a causal model of the variables upstream of utility, for it is capable of answering all causal queries for those variables.
Thm 1: agents achieving optimal policy (util max) across various local interventions must be able to answer causal queries for all variables upstream of the utility node
Thm 2: relaxation of above to nonoptimal policies, relating regret bounds to the accuracy of the reconstructed causal model
the proof is constructive—an algorithm that, when given access to regret-bounded-policy-oracle wrt an environment with some local intervention, queries them appropriately to construct a causal model
one implication is an algorithm for causal inference that converts black box agents to explicit causal models (because, y’know, agents like you and i are literally that aforementioned ‘regret-bounded-policy-oracle‘)
These selection theorems could be considered the converse of the well-known statement that given access to a causal model, one can find an optimal policy. (this and its relaxation to approximate causal models is stated in Thm 3)
Thm 1 / 2 is like a ‘causal good regulator‘ theorem.
gooder regulator theorem is not structural—as in, it gives conditions under which a model of the regulator must be isomorphic to the posterior of the system—a black box statement about the input-output behavior.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes (eg classification. a negative example would be an mdp). but authors claim this is mostly for simpler proofs and they think this can be relaxed.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes
I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”
tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance—that one should model a system as an agent insofar as doing so is useful.
Details
how is “policies implied by the utility function” operationalized? given a value u, we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among those policies that attain the utility u.
then union them over all the achievable values of u to get this “wide set of maxent policies,” and define goal directedness of a policy π wrt a utility function U as the maximum (negative) cross entropy between π and an element of the above set. (actually we get the same result if we quantify the min operation over just the set of maxent policies achieving the same utility as π.)
Intuition
intuitively, this is measuring: “how close is my policy π to being ‘deterministic,’ while ‘optimizing U at the competence level u(π)’ and not doing anything else ‘deliberately’?”
“close” / “deterministic” ~ large negative CE means small CE(π,πmaxent)=H(π)+KL(π||πmaxent)
“not doing anything else deliberately’” ~ because we’re quantifying over maxent policies. the policy is maximally uninformative/uncertain, the policy doesn’t take any ‘deliberate’ i.e. low entropy action, etc.
“at the competence level u(π)” ~ … under the constraint that it is identically competent to π
and you get the nice property of the measure being invariant to translation / scaling of U.
obviously so, because a policy is maxent among all policies achieving u on U iff that same policy is maxent among all policies achieving au+b on aU+b, so these two utilities have the same “wide set of maxent policies.”
Critiques
I find this measure problematic in many places, and am confused whether this is conceptually correct.
one property claimed is that the measure is maximum for uniquely optimal / anti-optimal policy.
it’s interesting that this measure of goal-directedness isn’t exactly an ~increasing function of u(π), and i think it makes sense. i want my measure of goal-directedness to, when evaluated relative to human values, return a large number for both aligned ASI and signflip ASI.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy.
My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Relatedly, the fact that the quantification is only happening over policies of the same competence level, which feels problematic.
minimum for uniformly random policy (this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
honestly the maxent motivation isn’t super clear to me.
not causal. the reason you need causal interventions is because you want to rule out accidental agency/goal-directedness, like a rock that happens to be the perfect size to seal a water bottle—does your rock adapt when I intervene to change the size of the hole? discovering agents is excellent in this regards.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn’t always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you’re following the unique optimal policy for some utility function, that’s a lot of evidence for goal-directedness. If you’re following one of many optimal policies, that’s a bit less evidence—there’s a greater chance that it’s an accident. In the most extreme case (for the constant utility function) every policy is optimal—and we definitely don’t want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it’s the case.
minimum for uniformly random policy (this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I’ve corrected it to this.
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check update.
I believe there’s nothing much in the way of actually implementing an approximation of Critch’s boundaries[1] using deep learning.
Recall, Critch’s boundaries are:
Given a world (markovian stochastic process) Wt, map its values W (vector) bijectively using f into ‘features’ that can be split into four vectors each representing a boundary-possessing system’s Viscera, Active Boundary, Passive Boundary, and Environment.
Then, we characterize boundary-ness (i.e. minimal information flow across features unmediated by a boundary) using two mutual information criterion each representing infiltration and exfiltration of information.
And a policy of the boundary-posessing system (under the ‘stance’ of viewing the world implied by f) can be viewed as a stochastic map (that has no infiltration/exfiltration by definition) that best approximates the true Wt dynamics.
The interpretation here (under low exfiltration and infiltration) is that f can be viewed as a policy taken by the system in order to perpetuate its boundary-ness into the future and continue being well-described as a boundary-posessing system.
All of this seems easily implementable using very basic techniques from deep learning!
Bijective feature map are implemented using two NN maps each way, with an autoencoder loss.
Mutual information is approximated with standard variational approximations. Optimize f to minimize it.
(the interpretation here being—we’re optimizing our ‘stance’ towards the world in a way that best views the world as a boundary-possessing system)
After you train your ‘stance’ using the above setup, learn the policy using an NN with standard SGD, with fixed f.
A very basic experiment would look something like:
Test the above setup on two cellular automata (e.g., GoL, Lenia, etc) systems, one containing just random ash, and the other some boundary-like structure like noise-resistant glider structures found via optimization (there are a lot of such examples in the Lenia literature).[2]
Then (1) check if the infiltration/exfiltration values are lower for the latter system, and (2) do some interp to see if the V/A/P/E features or the learned policy NN have any interesting structures.
I’m not sure if I’d be working on this any time soon, but posting the idea here just in case people have feedback.
I think research on boundaries—both conceptual work and developing practical algorithms for approximating them & schemes involving them—are quite important for alignment for reasons discussed earlier in my shortform.
Ultimately we want our setup to detect boundaries that aren’t just physically contiguous chunks of matter, like informational boundaries, so we want to make sure our algorithm isn’t just always exploiting basic locality heuristics.
I can’t think of a good toy testbed (ideas appreciated!), but one easy thing to try is to just destroy all locality by mapping the automata lattice (which we were feeding as input) with the output of a complicated fixed bijective map over it, so that our system will have to learn locality if it turns out to be a useful notion in its attempt at viewing the system as a boundary.
I don’t see much hope in capturing a technical definition that doesn’t fall out of some sort of game theory, and even the latter won’t directly work for boundaries as representation of respect for autonomy helpful for alignment (as it needs to apply to radically weaker parties).
Boundaries seem more like a landmark feature of human-like preferences that serves as a test case for whether toy models of preference are reasonable. If a moral theory insists on tiling the universe with something, it fails the test. Imperative to merge all agents fails the test unless the agents end up essentially reconstructed. And with computronium, we’d need to look at the shape of things it’s computing rather than at the computing substrate.
I think it’s plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch’s definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.
But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a lot of problems—some of which I suspect might be easier to solve if we treat systems-with-high-boundaryness as a sort of primitive for the kind-of-thing that we can associate agency and preferences with in the first place.
There are two different points here, boundaries as a formulation of agency, and boundaries as a major component of human values (which might be somewhat sufficient by itself for some alignment purposes). In the first role, boundaries are an acausal norm that many agents end up adopting, so that it’s natural to consider a notion of agency that implies boundaries (after the agent had an opportunity for sufficient reflection). But this use of boundaries is probably open to arbitrary ruthlessness, it’s not respect for autonomy of someone the powers that be wouldn’t sufficiently care about. Instead, boundaries would be a convenient primitive for describing interactions with other live players, a Schelling concept shared by agents in this sense.
The second role as an aspect of values expresses that the agent does care about autonomy of others outside game theoretic considerations, so it only ties back to game theory by similarity, or through the story of formation of such values that involved game theory. A general definition might be useful here, if pointing AIs at it could instill it into their values. But technical definitions don’t seem to work when you consider what happens if you try to protect humanity’s autonomy using a boundary according to such definitions. It’s like machine translation, the problem could well be well-defined, but impossible to formally specify, other than by gesturing at a learning process.
I no longer think the setup above is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.
(Note: I am thinking as I’m writing, so this might be a bit rambly.)
The world-trajectory distribution is ambiguous.
Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.
Now, notice that the measure of infiltration/exfiltration depends on ϕ∈Δ(Wω), a distribution over world history. Infil(ϕ):=Aggt≥0MutWω∼ϕ((Vt+1,At+1);Et∣(Vt,At,Pt))
So, for the above measure to capture my intuition, the approximate Markov condition (operationalized by low infil & exfil) must consider the world state Wω that contains the Lenia pattern with it avoiding bullets.
Remember, W is the raw world state, no coarse graining. So ϕ is the distribution over the raw world trajectory. It already captures all the “potentially occurring trajectories under which the system may take boundary-preserving-action.” Since everything is observed, our distribution already encodes all of “Nature’s Intervention.” So in some sense Critch’s definition is already causal (in a very trivial sense), by the virtue of requiring a distribution over the raw world trajectory, despite mentioning no Pearlian Causality.
Issue: Choice of ϕ
Maybe there is some canonical true ϕ for our physical world that minds can intersubjectively arrive at, so there’s no ambiguity.
But when I imagine trying to implement this scheme on Lenia, there’s immediately an ambiguity as to which distribution (representing my epistemic state on which raw world trajectories that will “actually happen”) we should choose:
Perhaps a very simple distribution: assigning uniform probability over world trajectories where the world contains nothing but the glider moving in a random direction with some initial point offset.
I suspect many stances other the one factorizing the world into gliders would have low infil/exfil, because the world is so simple. This is the case of “accidental boundary-ness.”
Perhaps something more complicated: various trajectories where e.g., the Lenia patterns encounters bullets, evolves with various other patterns, etc.
This I think rules out “accidental boundary-ness.”
I think the latter works. But now there’s a subjective choice of the distribution, and what are the set of possible/realistic “Nature’s Intervention”—all the situations that can ever be encountered by the system under which it has boundary-like behaviors—that we want to implicitly encode into our observational distribution. I don’t think it’s natural for ϕ assign much probability to a trajectory whose initial conditions are set in a very precise way such that everything decays into noise. But this feels quite subjective.
Hints toward a solution: Causality
I think the discussion above hints at a very crucial insight:
ϕ must arise as a consequence of the stable mechanisms in the world.
Suppose the world of Lenia contains various stable mechanisms like a gun that shoots bullets at random directions, scarce food sources, etc.
We want ϕ to describe distributions that the boundary system will “actually” experience in some sense. I want the “Lenia pattern dodges bullet” world trajectory to be considered, because there is a plausible mechanism in the world that can cause such trajectories to exist. For similar reasons, I think the empty world distributions are impoverished, and a distribution containing trajectories where the entire world decays into noise is bad because no mechanism can implement it.
Thus, unless you have a canonical choice of ϕ, a better starting point would be to consider the abstract causal model that encodes the stable mechanisms in the world, and using Discovering Agents-style interventional algorithms that operationalize the notion “boundaries causally separate environment and viscera.”
Well, because of everything mentioned above on how the causal model informs us on which trajectories are realistic, especially in the absence of a canonical ϕ. It’s also far more efficient, because the knowledge of the mechanism informs the algorithm of the precise interventions to query the world for, instead of having to implicitly bake them in ϕ.
There are still a lot more questions, but I think this is a pretty clarifying answer as to how Critch’s boundaries are limiting and why DA-style causal methods will be important.
I think the update makes sense in general, isn’t there however some way mutual information and causality is linked? Maybe it isn’t strong enough for there to be an easy extrapolation from one to the other.
Also I just wanted to drop this to see if you find it interesting, kind of on this topic? Im not sure its fully defined in a causality based way but it is about structure preservation.
Yeah I’d like to know if there’s a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn’t show up anything interesting. My guess is that we’d want separate boundary metrics for informational separation and causal separation.
Notes and reflections on the things I’ve learned while Doing Scholarship this week (i.e. studying math)[1].
I am starting to see the value of categorical thinking.
For example from [FOAG], it was quite mindblowing to learn that stalk (the set of germs at a point) can be equivalently defined as a simple colimit of sections of presheaf over open sets of X containing a point, and this definition made proving certain constructions (eg inducing a map of stalks from a map ϕ:X→Y) very easy.
Also, I was first introduced the concept of presheaf as an abstraction of a map that takes open sets and returns functions over it, abstracting properties like there existing a restriction map that composes naturally. Turns out (punchline, presumably) this is just a functor Open(X)Op−>Set!
Yoneda lemma is very cool. I recall seeing some of the ideas from Programs as Singularities (paper), where there are ideas of embedding programs (for the analogue in Yoneda lemma, the Yoneda embedding being a fully faithful functor from some category C … ) into a different space (… to the category SetC …) that contains “almost programs” ( … because the Yoneda embedding is not surjective, let alone essentially surjective), and that studying this enlarged space lends insight into the original space.
Rabbit hole: Yoneda lemma as expressing consistency conditions on Lawvere’s idea of Space and Quantity being presheaf and copresheaf??
I am also starting to more appreciate the notion of sheaf or ringed spaces from [FOAG] - or more generally, the notion that a “space” can be productively studied by studying functions defined on it. For example I learned from [Bredon] that a manifold, whose usual definition is a topological space locally homeomorphic to a Euclidean space, can equivalently be defined as a ringed space whose structure sheaf is valued in some subalgebra of continuous maps over a given open set. Very cool!
Started reading [Procesi] to learn invariant theory and representation theory because it came up quite often as my bottleneck in my recent work (eg). Also interpretability, apparently. So far I just read pg 1-9, reviewing the very basics of group action (e.g., orbit stabilizer theorem). Lie groups aren’t coming up until pg ~50 so until then I should catch up on the relevant Lie group prerequisites through [Lee] or [Bredon].
Also reviewed some basic topology by skimming pg 1-50 of [Bredon]. So many rabbit holes just in point-set topology that I can’t afford (time-wise) to follow, e.g.,
(1) nets generalize sequences and successfully characterize topological properties—I once learned of filters, and I do not yet know how they relate (and don’t relate constructively + why constructively filters are more natural) and especially universal net vs ultrafilter
(2) I didn’t know that manifolds are metrizable, but yes they are by an easy consequence of the Urysohn metrization theorem (second-countable & completely regular ⇒ metrizable). But I would like to study this proof in more detail. Also, how to intuitively think about the metric of a manifold?
I didn’t know that the proof to Urysohn metrization was this nice! It’s a consequence of the following lemma: recall, “completely regular” means given a point x and a closed set x∉C, there exists a continuous f:X→[0,1] s.t.f(x)=0 and f(C)=1. BUT adding second-countability to the hypothesis then lets you choose this f from a fixed, countable family F.
Then, mapping X under this countable family of functions (thus taking value in [0,1]N) turns out to be an embedding—and [0,1]N can be metrized, so X can be metrized as well.
(3) I learned about various natural variants / generalizations of compactness (σ-compactness, local compactness, paracompactness). My understanding of their importance is because:
(a) paracompactness implies the existence of partition of unity subordinate to any open cover (a consequence of paracompactness ⇒ normal, and Urysohn’s lemma, also paracompactness by definition allowing you to find the open refinement of the given open cover as required by the definition of partition of unity subordinate to an open cover.)
(b) for locally compact Hausdorff X, we can characterize paracompactness by “disjoint union of open σ-compact subsets,” which is much easier to check than the definition of paracompactness as locally finite open refinement of open covers.
e.g., from this, it is immediate that manifolds are paracompact: (1) locally Euclidean ⇒ locally compact. (2) Second-countable ⇒ Lindelof. (3) Lindelof & locally compact ⇒ σ-compact. (1) & (2) & (3) + above ⇒ manifolds are paracompact. From which other properties of manifolds immediately follow from that of paracompactness, eg manifolds always admit a partition of unity subordinate to any open cover.
But rabbit hole: recall, open sets axiomatize semidecidable properties. What is, then, the logical interpretation of compactness, σ-compactness, local compactness, paracompactness?
This week, I’ll start tracking the exercises I solve and pages I cover and post them in next week’s shortform (EDIT: biweekly), so that I can keep track of my progress + additional accountability.
[Procesi]: Procesi, Lie Groups: An Approach through Invariants and Representations
and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week’s shortform I haven’t written down my progress this week on each of these books nor the problems I’ve solved because I haven’t started tracking them, so I’ll do them starting next week.
Started reading [Procesi] to learn invariant theory and representation theory because it came up quite often as my bottleneck in my recent work (eg). Also interpretability, apparently. So far I just read pg 1-9, reviewing the very basics of group action (e.g., orbit stabilizer theorem). Lie groups aren’t coming up until pg ~50 so until then I should catch up on the relevant Lie group prerequisites through [Lee] or [Bredon].
Woit’s “Quantum Theory, Groups and Representations” is fantastic for this IMO. It gives physical motivation for representation theory, connects it to invariants and, of course, works through the physically important lie-groups. The intuitions you build here should generalize. Plus, it’s well written.
Also, if you are ever in the market for differential topology, algebraic topology, and algebraic geometry, then I’d recommend Ronald Brown’s “Topology and Groupoids.” It presents the basic material of topology in a way that generalizes better to the fields above, along with some powerful geometric tools for calculations.
Thanks for the recommendation! Woit’s book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg’s Group Theory and Physics to be a good representation theory & physics book.
I did encounter Brown’s book during my search for algebraic topology books but I had to pass it over Bredon’s because it didn’t develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.
No worries! For more recommendations like those two, I’d suggest having a look at “The Fast Track” on Sheafification. Of the books I’ve read from that list, all were fantastic. Note that site emphasises mathematics relevant for physics, and vice versa, so it might not be everyone’s cup of tea. But given your interests, I think you’ll find it useful.
why bayesnets and markovnets? factorized cognition, how to do efficient bayesian updates in practice, it’s how our brain is probably organized, etc. why would anyone want to study this subject if they’re doing alignment research? explain philosophy behind them.
simple examples of bayes nets. basic factorization theorems (the I-map stuff and separation criterion)
tangent on why bayes nets aren’t causal nets, though Zack M Davis had a good post on this exact topic, comment threads there are high insight
how inference is basically marginalization (basic theorems of: a reduced markov net represents conditioning, thus inference upon conditioning is the same as marginalization on a reduced net)
why is marginalization hard? i.e. NP-completeness of exact and approximate inference worst-case what is a workaround? solve by hand simple cases in which inference can be greatly simplified by just shuffling in the order of sums and products, and realize that the exponential blowup of complexity is dependent on a graphical property of your bayesnet called the treewidth
exact inference algorithms (bounded by treewidth) that can exploit the graph structure and do inference efficiently: sum-product / belief-propagation
approximate inference algorithms (works in even high treewidth! no guarantee of convergence) - loopy belief propagation, variational methods, etc
connections to neuroscience: “the human brain is just doing belief propagation over a bayes net whose variables are the cortical column” or smth, i just know that there is some connection
Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.
Recall, an abstracting causal model (ACM)—exact transformations, τ-abstractions, and approximations—is a map between two structural causal models satisfying certain requirements that lets us reasonably say one is an abstraction, or a high-level causal model of another.
Broadly speaking, the condition is a sort of causal consistency requirement. It’s a commuting diagram that requires the “high-level” interventions to be consistent with various “low-level” ways of implementing that intervention. Approximation errors talk about how well the diagram commutes (given that the support of the variables in the high-level causal model is equipped with some metric)
Now consider a curve: x-axis is the node count, and y-axis is the minimum approximation error of ACMs of the original system with that node count (subject to some conditions[1]). It would hopefully an decreasing one[2].
This curve would represent the abstractability of a system. Lower the curve, the more abstractable it is.
Aside: we may conjecture that natural systems will have discrete jumps, corresponding to natural modules. The intuition being that, eg if we have a physics model of two groups of humans interacting, in some sense 2 nodes (each node representing the human-group) and 4 nodes (each node representing the individual-human) are the most natural, and 3 nodes aren’t (perhaps the 2 node system with a degenerate node doing ~nothing, so it would have very similar approximation scores with the 2 node case).
Then, try hard to prove a selection theorem of the following form: given low-level causal model satisfying certain criteria (eg low regret over varying objectives, connection costs), the abstractability curve gets pushed further downwards. Or conversely, find conditions that make this true.
I don’t know how to prove this[3], but at least this gets closer to a well-defined mathematical problem.
I’ve been thinking about this for an hour now and finding the right definition here seems a bit non-trivial. Obviously there’s going to be an ACM of zero approximation error for any node count, just have a single node that is the joint of all the low-level nodes. Then the support would be massive, so a constraint on it may be appropriate.
Or instead we could fold it in to the x-axis—if there is perhaps a non ad-hoc, natural complexity measure for Bayes Nets that capture [high node counts ⇒ high complexity because each nodes represent stable causal mechanisms of the system, aka modules] and [high support size ⇒ high complexity because we don’t want modules that are “contrived” in some sense] as special cases, then we could use this as the x-axis instead of just node count.
Immediate answer: Restrict this whole setup into a prediction setting so that we can do model selection. Require on top of causal consistency that both the low-level and high-level causal model have a single node whose predictive distribution are similar. Now we can talk about eg the RLCT of a Bayes Net. I don’t know if this makes sense. Need to think more.
I suspect closely studying the robust agents learn causal world models paper would be fruitful, since they also prove a selection theorem over causal models. Their strategy is to (1) develop an algorithm that queries an agent with low regret to construct a causal model, (2) prove that this yields an approximately correct causal model of the data generating model, (3) then arguing that this implies the agent must internally represent something isomorphic to a causal world model.
A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following: p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes TG-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
Any thoughts on how to customize LessWrong to make it LessAddictive? I just really, really like the editor for various reasons, so I usually write a bunch (drafts, research notes, study notes, etc) using it but it’s quite easy to get distracted.
You could use the ad & content blocker uBlock Origin to zap any addictive elements of the site, like the main page feed or the Quick Takes or Popular Comments. Then if you do want to access these, you can temporarily turn off uBlock Origin.
Incidentally, uBlock Origin can also be installed on mobile Firefox, and you can manually sync its settings across devices.
moments of microscopic fun encountered while studying/researching:
Quantum mechanics call vector space & its dual bra/ket because … bra-c-ket. What can I say? I like it—But where did the letter ‘c’ go, Dirac?
Defining cauchy sequences and limits in real analysis: it’s really cool how you “bootstrap” the definition of Cauchy sequences / limit on real using the definition of Cauchy sequences / limit on rationals. basically:
(1) define Cauchy sequence on rationals
(2) use it to define limit (on rationals) using rational-Cauchy
(3) use it to define reals
(4) use it to define Cauchy sequence on reals
(5) show it’s consistent with Cauchy sequence on rationals in both directions
a. rationals are embedded in reals hence the real-Cauchy definition subsumes rational-Cauchy definition
b. you can always find a rational number smaller than a given real number hence a sequence being rational-Cauchy means it is also real-Cauchy)
(6) define limit (on reals)
(7) show it’s consistent with limit on rationals
(8) … and that they’re equivalent to real-Cauchy
(9) proceed to ignore the distinction b/w real-Cauchy/limit and their rational counterpart. Slick!
Maybe he dropped the “c” because it changes the “a” phoneme from æ to ɑː and gives a cleaner division in sounds: “brac-ket” pronounced together collides with “bracket” where “braa-ket” does not.
Any advice on reducing neck and shoulder pain while studying? For me that’s my biggest blocker to being able to focus longer (especially for math, where I have to look down at my notes/book for a long period of time). I’m considering stuff like getting a standing desk or doing regular back/shoulder exercises. Would like to hear what everyone else’s setups are.
Train skill of noticing tension and focus on it. Tends to dissolve. No that’s not so satisfying but it works. Standing desk can help but it’s just not that comfortable for most.
I still have lots of neck and shoulder tension, but the only thing I’ve found that can reliably lessen it is doing some hard work on a punching bag for about 20 minutes every day, especially hard straights and jabs with full extension.
(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
Notes and reflections on the things I’ve learned while Doing Scholarship the last two week (i.e. studying math).
Mostly the past two weeks were on differential geometry (Lee):
Ch 4 (Submersion, Immersion, Embedding) comments:
Conceptually, by the Constant rank theorem, constant rank maps (smooth maps whose differential dFp:TpM→TF(p)N is constant rank at all p) are precisely the maps with a linear local coordinate representation (thus are maps well-modeled locally by its differentials).
Basically a nonlinear version of the linear algebra theorem that any square matrix can be expressed as [Ir000]. The proof is much more complicated however: basically a clever choice of coordinate transformation via the inverse function theorem.
The point of the chapter is to come up with various characterizations of submersion, immersion, embedding. For example, 1) smooth immersion iff locally smooth embedding, 2) smooth submersion iff every point is an image of a local section, 3) surjective maps ⇒ submersion & injective ⇒ immersion …
The proof of 3) is a very cool application of the Baire category theorem. Baire category theorem says the countable union of nowhere dense sets has empty interior; this is not very motivating, but reading Bredon[1] helped clarify its conceptual significance.
Namely, consider the more illuminating contrapositive statement: countable intersection of dense open sets is dense. Conceptually, the space is some configuration space, and dense open sets represent configurations that satisfy certain generically satisfied constraints (polynomial p(x) being nonzero is a prototypical example, which is a dense & open set). Then, the question is whether the property of a countable number of these constraints being satisfied at the same time is still generic, i.e. dense. The Baire category theorem says this is indeed the case (for locally compact Hausdorff spaces).
Sections are just right inverses, and their intuitive geometric content was a bit confusing until I read the wikipedia page: a section of f is an abstraction of a graph by viewing f as a sort of “projection map.” That makes sense! I’m sure this will come up later in the fiber bundle context.
The “figure-eight curve” and “dense torus map” as prototypical examples of smooth immersions that isn’t a smooth embedding, due to topological considerations.
Ch 5 (Submanifold) comments:
Similar to Ch 4, many useful characterizations of submanifolds and how to generate them. eg embedded submanifold iff locally a “slice” of the ambient manifold’s coordinate chart. embedded submanifold iff image of smooth embedding, immersed submanifold iff image of smooth immersion. Level sets of a smooth map at a “regular value” are embedded submanifolds …
Ch 6 (Sard’s theorem) comments:
Finally, one of the more fun chapters! Finally learned the proof of the Whitney embedding / immersion theorem that I’ve heard a lot about.
The compact case of the Whitney embedding theorem is much more conceptually straightforward:
Given a m (finite, possible since compact) chart of the n-dim manifold, literally just adjoin them while multiplying them with appropriate partitions of unity to get a M→Rnm map, and adjoin the m partitions of unity (a “chart indicator variable”) to get a M→Rnm+m map. This turns out to be an immersion, and thus an embedding since M is compact.
Apply the projection map RN→RN−1 with a 1-dim kernel Rv. By Sard’s theorem, this turns out to be an immersion (when restricted to M) for almost any choice of v, as long as N>2n+1. Repeatedly apply this to the massive codomain M→Rnm+m to get an immersion to R2n+1.
This projection map can in fact be promoted to an embedding, given that the original immersion of M to R^n is an embedding.
High-level takeaways:
The most dumb and obvious way of interpolating coordinate charts into a global map via partitions of unity, with slight modifications, gives a bona fide immersion of a manifold into RN!!
It was interesting to learn that there was a 1-2 decade period of foundational uncertainty (between the first proposal of the abstract manifold definition and Whitney’s above proof) where people didn’t know whether the abstract manifold definition was actually more general than RN or not.[2]
Partitions of unity really is used everywhere. I wonder how the theory of complex analytic manifolds ever do anything when analytic partitions of unity don’t exist.
Proof strategy of promoting a smooth map to a proper map (at the cost of increased dimensionality of the codomain) by literally adjoining a proper map next to it. Clever!
I presume this is the main motivation behind exhaustion functions (f:M→R s.t.f−1((−∞,c]) is compact ∀c∈R). It’s a proper map, it exists for any manifolds (again, shown by partitions of unity), and has codomain of dimension 1 so it minimally increases the function codomain dimension.
More applications on Whitney approximation theorems and transversality arguments.
The latter, including the transversality homotopy theorem (actually learned this a year ago in my difftop class, though that class used Guillemin’s book where manifolds are always embedded in RN - so it’s good to learn them from a more intrinsic perspective) is very interesting.
It also ties to one of my motivation for all this math learning, backchaining from trying to do good alignment theory work, which is learning the math of structural stability and its role in the theory of forms (morphogenesis) cf Thom, Structural Stability and Morphogenesis (thank you Dan Murfet for explaining this perspective).
So much more elegant than the standard definition via charts and maximal smooth structures and such. Unsure of the utility of this characterization though, lol (read Lawvere’s paper).
“It is better to have a good category with bad objects than a bad category with good objects.”—Grothendieck (probably not). For example, the category of smooth manifolds is not nice, motivating smooth sets, diffeological spaces, and so on.
I found this intuition for adjoint functors illuminating. Specifically, note set maps f:X→Y and g:Y→X being inverses are equivalent to the condition that their graphs are mirrored along the diagonal, i.e. (x,f(x))=(g(y),y). Rephrase this using Kronecker delta, δ(x,g(y))=δ(f(x),y). Now δ can be seen as expressing a “relation” that could be exhibited by two elements of a set, i.e. equality (1) or inequality (0). But in general categories, objects can exhibit more relations—so replace δ by Hom - you get adjoint functors!
Why that long? The dimensionality reduction by projection is perhaps more nontrivial because of Sard, but the obvious gluing should have been sufficient to construct an immersion at least, albeit at the cost of inefficient codomain dimension. Maybe the historically difficult part was the concept of partition of unity and that it always exist in manifolds?
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I find this paper very enlightening from a conceptual perspective!
I’ve tried to think of problems that needed to be solved before we can actually implement this on real systems—both conceptual and practical—on approximate order of importance.
There are no ‘dynamics,’ no learning. As soon as a mechanism node is edited, it is assumed that agents immediately change their ‘object decision variable’ (a conditional probability distribution given its object parent nodes) to play the subgame equilibria.
Assumption of factorization of variables into ‘object’ / ‘mechanisms,’ and the resulting subjectivity. The paper models the process by which an agent adapts its policy given changes in the mechanism of the environment via a ‘mechanism decision variable’ (that depends on its mechanism parent nodes), which modulates the conditional probability distribution of its child ‘object decision variable’, the actual policy.
For example, the paper says a learned RL policy isn’t an agent, because interventions in the environment won’t make it change its already-learned policy—but that a human or a RL policy together with its training process is an agent, because it can adapt. Is this reasonable?
Say I have a gridworld RL policy that’s learned to get cheese (3 cell world, cheese always on left) by always going to the left. Clearly it can’t change its policy when I change the cheese distribution to favor right, so it seems right to call this not an agent.
Now, say the policy now has sensory access to the grid state, and correctly generalized (despite only being trained on left-cheese) to move in the direction where it sees the cheese, so when I change the cheese distribution, it adapts accordingly. I think it is right to call this an agent?
Now, say the policy is an LLM agent (static weight) on an open world simulation which reasons in-context. I just changed the mechanism of the simulation by lowering the gravity constant, and the agent observes this, reasons in-context, and adapts its sensorimotor policy accordingly. This is clearly an agent?
I think this is because the paper considers, in the case of the RL policy alone, the ‘object policy’ to be the policy of the trained neural network (whose induced policy distribution is definitionally fixed), and the ‘mechanism policy’ to be a trivial delta function assigning the already-trained object policy. And in the case of the RL policy together with its training process, the ‘mechanism policy’ is now defined as the training process that assigns the fully-trained conditional probability distribution to the object policy.
But what if the ‘mechanism policy’ was the in-context learning process by which it induces an ‘object policy’? Then changes in the environment’s mechanism can be related to the ‘mechanism policy’ and thus the ‘object policy’ via in-context learning as in the second and third example, making them count as agents.
Ultimately, the setup in the paper forces us to factorize the means-by-which-policies-adapt into mechanism vs object variables, and the results (like whether a system is to be considered an agent) depends on this factorization. It’s not always clear what the right factorization is, how to discover them from data, or if this is the right frame to think about the problem at all.
Implicit choice of variables that are convenient for agent discovery. The paper does mention that the algorithm is dependent in the choice of the variable, as in: if the node corresponding to the ‘actual agent decision’ is missing but its children is there, then the algorithm will label its children to be the decision nodes. But this is already a very convenient representation!
Prototypical example: Minecraft world with RL agents interacting represented as a coarse-grained lattice (dynamical Bayes Net?) with each node corresponding to a physical location and its property, like color. Clearly no single node here is an agent, because agents move! My naive guess is that in principle, everything will be labeled an agent.
So the variables of choice must be abstract variables of the underlying substrate, like functions over them. But then, how do you discover the right representation automatically, in a way that interventions in the abstract variable level can faithfully translate to actually performable interventions in the underlying substrate?
Given the causal graph, even the slightest satisfaction of the agency-criterion labels the nodes as decision / utility. No “degree-of-agency”—maybe by summing over the extent to which the independencies fail to satisfy?
Then different agents are defined as causally separated chunks (~connected component) of [set-of-decision-nodes / set-of-utility-nodes]. How do we accommodate hierarchical agency (like subagents), systems with different degrees of agency, etc?
The interventional distribution on the object/mechanism variables are converted into a causal graph using the obvious [perform-do()-while-fixing-everything-else] algorithm. My impression is that causal discovery doesn’t really work in practice, especially in noisy reality with a large number of variables via gazillion conditional independence tests.
The correctness proof requires lots of unrealistic assumptions, e.g., agents always play subgame equilibria, though I think some of this can be relaxed.
I am curious as to how often the asymptotic results proven using features of the problem that seem basically practically-irrelevant become relevant in practice.
Like, I understand that there are many asymptotic results (e.g., free energy principle in SLT) that are useful in practice, but i feel like there’s something sus about similar results from information theory or complexity theory where the way in which they prove certain bounds (or inclusion relationship, for complexity theory) seem totally detached from practicality?
joint source coding theorem is often stated as why we can consider the problem of compression and redundancy separately, but when you actually look at the proof it only talks about possibility (which is proven in terms of insanely long codes) and thus not-at-all trivial that this equivalence is something that holds in the context of practical code-engineering
complexity theory talks about stuff like quantifying some property over all possible boolean circuits of a given size which seems to me considering a feature of the problem just so utterly irrelevant to real programs that I’m suspicious it can say meaningful things about stuff we see in practice
as an aside, does the P vs NP distinction even matter in practice? we just … seem to have very good approximation to NP problems by algorithms that take into account the structures specific to the problem and domains where we want things to be fast; and as long as complexity methods doesn’t take into account those fine structures that are specific to a problem, i don’t see how it would characterize such well-approximated problems using complexity classes.
Wigderson’s book had a short section on average complexity which I hoped would be this kind of a result, and I’m unimpressed (the problem doesn’t sound easier—now how do you specify the natural distribution??)
One result to mention in computational complexity is the PCP theorem which not only gives probabilistically checkable proofs but also gives approximation case hardness. Seems deep but I haven’t understood the proof yet.
Great question. I don’t have a satisfying answer. Perhaps a cynical answer is survival bias—we remember the asymptotic results that eventually become relevant (because people develop practical algorithms or a deeper theory is discovered) but don’t remember the irrelevant ones.
Existence results are categorically easier to prove than explicit algorithms. Indeed, classical existence may hold (the former) while intuitioinistically (the latter) might not. We would expect non-explicit existence results to appear before explicit algorithms.
One minor remark on ‘quantifying over all boolean algorithms’. Unease with quantification over large domains may be a vestige of set-theoretic thinking that imagines types as (platonic) boxes. But a term of a for-all quantifier is better thought of as an algorithm/ method to check the property for any given term (in this case a Boolean circuit). This doesn’t sound divorced from practice to my ears.
as an aside, does the P vs NP distinction even matter in practice?
Yes, it does, for several reasons:
It basically is necessary to prove P != NP to get a lot of other results to work, and for some of those results, proving P != NP is sufficient.
If P != NP (As most people suspect), it fundamentally rules out solving lots of problems generally and quickly without exploiting structure, and in particular lets me flip the burden of proof to the algorithm maker to explain why their solution to a problem like SAT is efficient, rather than me having to disprove the existence of an efficient algorithm.
It’s either by exploiting structure, somehow having a proof that P=NP, or relying on new physics models that enable computing NP-complete problems efficiently, and the latter 2 need very, very strong evidence behind them.
This in particular applies to basically all learning problems in AI today.
It explains why certain problems cannot be reasonably solved optimally, without huge discoveries, and the best examples are travelling salesman problems for inability to optimally solve, as well as a whole lot of other NP-complete problems. There are also other NP problems where there isn’t a way to solve them efficiently at all, especially if FPT != W[1] holds.
Also a note that we also expect a lot of NP-complete problems to also not be solvable by fast algorithms even in the average case, which basically means it’s likely to be very relevant quite a lot of the time, so we don’t have to limit ourselves to the worst case either.
I recently learned about metauni, and it looks amazing. TL;DR, a bunch of researchers give out lectures or seminars on Roblox—Topics include AI alignment/policy, Natural Abstractions, Topos Theory, Singular Learning Theory, etc.
I haven’t actually participated in any of their live events yet and only watched their videos, but they all look really interesting. I’m somewhat surprised that there hasn’t been much discussion about this on LW!
Complaint with Pugh’s real analysis textbook: He doesn’t even define the limit of a function properly?!
It’s implicitly defined together with the definition of continuity where ∀ϵ>0∃δ>0|x−x0|<δ⟹|f(x)−f(x0)|<ϵ, but in Chapter 3 when defining differentiability he implicitly switches the condition to 0<|x−x0|<δ without even mentioning it (nor the requirement that x0 now needs to be an accumulation point!) While Pugh has its own benefits, coming from Terry Tao’s analysis textbook background, this is absurd!
(though to be fair Terry Tao has the exact same issue in Book 2, where his definition of function continuity via limit in metric space precedes that of defining limit in general … the only redeeming factor is that it’s defined rigorously in Book 1, in the limited context of R)
*sigh* I guess we’re still pretty far from reaching the Pareto Frontier of textbook quality, at least in real analysis.
… Speaking of Pareto Frontiers, would anyone say there is such a textbook that is close to that frontier, at least in a different subject? Would love to read one of those.
Maybe you should email Pugh with the feedback? (I audited his honors analysis course in fall 2017; he seemed nice.)
As far as the frontier of analysis textbooks goes, I really like how Schröder Mathematical Analysis manages to be both rigorous and friendly: the early chapters patiently explain standard proof techniques (like the add-and-subtract triangle inequality gambit) to the novice who hasn’t seen them before, but the punishing details of the subject are in no way simplified. (One wonders if the subtitle “A Concise Introduction” was intended ironically.)
I used to try out near-random search on ideaspace, where I made a quick app that spat out 3~5 random words from a dictionary of interesting words/concepts that I curated, and I spent 5 minutes every day thinking very hard on whether anything interesting came out of those combinations.
Of course I knew random search on exponential space was futile, but I got a couple cool invention ideas (most of which turned out to already exist), like:
infinite indoor rockclimbing: attach rocks to a vertical treadmill, and now you have an infinite indoor rock climbing wall (which is also safe from falling)! maybe add some fancy mechanism to add variations to the rocks + a VR headgear, I guess.
clever crypto mechanism design (in the spirit of CO2 Coin) to incentivize crowdsourcing of age-reduction molecule design animal trials from the public. (I know what you’re thinking)
You can probably do this smarter now if you wanted, with eg better GPT models.
Having lived ~19 years, I can distinctly remember around 5~6 times when I explicitly noticed myself experiencing totally new qualia with my inner monologue going “oh wow! I didn’t know this dimension of qualia was a thing.” examples:
hard-to-explain sense that my mind is expanding horizontally with fractal cube-like structures (think bismuth) forming around it and my subjective experience gliding along its surface which lasted for ~5 minutes after taking zolpidem for the first time to sleep (2 days ago)
getting drunk for the first time (half a year ago)
feeling absolutely euphoric after having a cool math insight (a year ago)
...
Reminds me of myself around a decade ago, completely incapable of understanding why my uncle smoked, being “huh? The smoke isn’t even sweet, why would you want to do that?” Now that I have [addiction-to-X] as a clear dimension of qualia/experience solidified in myself, I can better model their subjective experiences although I’ve never smoked myself. Reminds me of the SSC classic.
Also one observation is that it feels like the rate at which I acquire these is getting faster, probably because of increase in self-awareness + increased option space as I reach adulthood (like being able to drink).
Anyways, I think it’s really cool, and can’t wait for more.
Sunlight scattered by the atmosphere on cloudless mornings during the hour before sunrise inspires a subtle feeling (“this is cool, maybe even exciting”) that I never noticed till I started intentionally exposing myself to it for health reasons (specifically, making it easier to fall asleep 18 hours later).
More precisely, I might or might not have noticed the feeling, but if I did notice it, I quickly forgot about it because I had no idea how to reproduce it.
I have to get away from artificial light (streetlamps) (and from direct (yellow) sunlight) for the (blue) indirect sunlight to have this effect. Also, it is no good looking at a small patch of sky, e.g., through a window in a building: most or all of the upper half of my field of vision must be receiving this indirect sunlight. (The intrinsically-photosensitive retinal ganglion cells are all over the bottom half of the retina, but absent from the top half.)
To me, the fact that the human brain basically implements SSL+RL is very very strong evidence that the current DL paradigm (with a bit of “engineering” effort, but nothing like fundamental breakthroughs) will kinda just keep scaling until we reach point-of-no-return. Does this broadly look correct to people here? Would really appreciate other perspectives.
I mostly think “algorithms that involve both SSL and RL” is a much broader space of possible algorithms than you seem to think it is, and thus that there are parts of this broad space that require “fundamental breakthroughs” to access. For example, both AlexNet and differentiable rendering can be used to analyze images via supervised learning with gradient descent. But those two algorithms are very very different from each other! So there’s more to an algorithm than its update rule.
See also 2nd section of this comment, although I was emphasizing alignment-relevant differences there whereas you’re talking about capabilities. Other things include the fact that if I ask you to solve a hard math problem, your brain will be different (different weights, not just different activations / context) when you’re halfway through compared to when you started working on it (a.k.a. online learning, see also here), and the fact that brain neural networks are not really “deep” in the DL sense. Among other things.
Makes sense. I think we’re using the terms differently in scope. By “DL paradigm” I meant to encompass the kind of stuff you mentioned (RL-directing-SS-target (active learning), online learning, different architecture, etc) because they really seemed like “engineering challenges” to me (despite them covering a broad space of algorithms) in the sense that capabilities researchers already seem to be working on & scaling them without facing any apparent blockers to further progress, i.e. in need of any “fundamental breakthroughs”—by which I was pointing more at paradigm shifts away from DL like, idk, symbolic learning.
I have a slightly different takeaway. Yes techniques similar to current techniques will most likely lead to AGI but it’s not literally ‘just scaling LLMs’. The actual architecture of the brain is meaningfully different from what’s being deployed right now. So different in one sense. On the other hand it’s not like the brain does something completely different and proposals that are much closer to the brain architecture are in the literature (I won’t name them here...). It’s plausible that some variant on that will lead to true AGI. Pure hardware scaling obviously increases capabilities in a straightforward way but a transformer is not a generally intelligent agent and won’t be even if scaled many more OOMs.
(I think Steven Byrnes has a similar view but I wouldn’t want to misrepresent his views)
a transformer is not a generally intelligent agent and won’t be even if scaled many more OOMs
So far as I can tell, a transformer has three possible blockers (that would need to stand undefeated together): (1) in-context learning plateauing at a level where it’s not able to do even a little bit of useful work without changing model weights, (2) terrible sample efficiency that asks for more data than is available on new or rare/situational topics, and (3) absence of a synthetic data generation process that’s both sufficiently prolific and known not to be useless at that scale.
A need for online learning and terrible sample efficiency are defeated by OOMs if enough useful synthetic data can be generated, which the anemic in-context learning without changing weights might turn out to be sufficient for. This is the case of defeating (3), with others falling as a result.
Another possibility is that much larger multimodal transformers (there is a lot of video) might suffice without synthetic data if a model learns superintelligent in-context learning. SSL is not just about imitating humans, the problems it potentially becomes adept at solving are arbitrarily intricate. So even if it can’t grow further and learn substantially new things within its current architecture/model, it might happen to already be far enough along at inference time to do the necessary redesign on its own. This is the case of defeating (1), leaving it to the model to defeat the others. And it should help with (3) even at non-superintelligent levels.
Failing that, RL demonstrates human level sample efficiency in increasingly non-toy settings, promising that saner amounts of useful synthetic data might suffice, defeating (2), though at this point it’s substantially not-a-transformer.
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
I don’t know what superintelligent in-context learning is
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-trainingwith somefrozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.
I wonder if the following is possible to study textbooks more efficiently using LLMs:
Feed the entire textbook to the LLM and produce a list of summaries that increases in granularity and length, covering all the material in the textbook just at a different depth (eg proofs omitted, further elaboration on high-level perspectives, etc)
The student starts from the highest-level summary, and gradually moves to the more granular materials.
When I study textbooks, I spend a significant amount of time improving my mental autocompletion, like being able to familiarize myself with the terminologies, which words or proof-style usually come in which context, etc. Doing this seems to significantly improve my ability to read eg long proofs, since I can ignore all the pesky details (which I can trust my mental autocompletion to later fill in the details if needed) and allocate my effort in getting a high-level view of the proof.
Textbooks don’t really admit this style of learning, because the students don’t have prior knowledge of all the concept-dependencies of a new subject they’re learning, and thus are forced to start at the lowest-level and make their way up to the high-level perspective.
Perhaps LLMs will let us reverse this direction, instead going from the highest to the lowest.
What’s a good technical introduction to Decision Theory and Game Theory for alignment researchers? I’m guessing standard undergrad textbooks don’t include, say, content about logical decision theory. I’ve mostly been reading posts on LW but as with most stuff here they feel more like self-contained blog posts (rather than textbooks that build on top of a common context) so I was wondering if there was anything like a canonical resource providing a unified technical / math-y perspective on the whole subject.
i absolutely hate bureaucracy, dumb forms, stupid websites etc. like, I almost had a literal breakdown trying to install Minecraft recently (and eventually failed). God.
I think what’s so crushing about it, is that it reminds me that the wrong people are designing things, and that they wont allow them to be fixed, and I can only find solace in thinking that the inefficiency of their designs is also a sign that they can be defeated.
God, I wish real analysis was at least half as elegant as any other math subject — way too much pathological examples that I can’t care less about. I’ve heard some good things about constructivism though, hopefully analysis is done better there.
Yeah, real analysis sucks. But you have to go through it to get to delightful stuff— I particularly love harmonic and functional analysis. Real analysis is just a bunch of pathological cases and technical persnicketiness that you need to have to keep you from steering over a cliff when you get to the more advanced stuff. I’ve encountered some other subjects that have the same feeling to them. For example, measure-theoretic probability is a dry technical subject that you need to get through before you get the fun of stochastic differential equations. Same with commutative algebra and algebraic geometry, or point-set topology and differential geometry.
Constructivism, in my experience, makes real analysis more mind blowing, but also harder to reason about. My brain uses non-constructive methods subconsciously, so it’s hard for me to notice when I’ve transgressed the rules of constructivism.
As a general reflection on undergraduate mathematics imho there is way too much emphasis on real analysis. Yes, knowing how to be rigorous is important, being aware of pathological counterexample is importanting, and real analysis is used all over the place. But there is so much more to learn in mathematics than real analysis and the focus on minor technical issues here is often a distraction to developing a broad & deep mathematical background.
For most mathematicians (and scientists using serious math) real analysis is a only a small part of the toolkit. Understanding well the different kinds of limits can ofc be crucial in functional analysis, stochastic processes and various parts of physics. But there are so many topics that are important to know and learn here!
The reason it is so prominent in the undergraduate curriculum seems to be more tied to institutional inertia, its prominence on centralized exams, relation with calculus, etc
Really, what’s going on is that in the general case, as mathematics is asked to be more and more general, you will start encountering pathological examples more, and paying attention to detail more is a valuable skill in both math and real life.
And while being technical about the pathological cases is kind of annoying, it’s also one that actually matters in real life, as you aren’t guaranteed to have an elegant solution to your problems.
Update: huh, nonstandard analysis is really cool. Not only are things much more intuitive (by using infinitesimals from hyperreals instead of using epsilon-delta formulation for everything), by the transfer principle all first order statements are equivalent between standard and nonstandard analysis!
Selection vs Control as distinguishing different types of “space of possibilities”
Selection as having that space explicitly given & selectable numerous times by the agent
Control as having that space only given in terms of counterfactuals, and the agent can access it only once.
These distinctions correlate with the type of algorithm being used & its internal structure, where Selection uses more search-like process using maps, while Control may just use explicit formula … although it may very well use internal maps to Select on counterfactual outcomes!
In other words, the Selection vs Control may very well be viewed as a different cluster of Analysis. Example:
If we decide to focus our Analysis of “space of possibilities” on eg “Real life outcome,” then a guided missile is always Control.
But if we decide to focus on “space of internal representation of possibilities,” then a guided missle that uses internal map to search on becomes Selection.
Greater emphasis on internal structure—specifically, “maps.”
Maps are capital investment, allowing you to be able to optimize despite not knowing what to exactly optimize for (by compressing info)
I have several thoughts on these framings, but one trouble is the excessive usage of words to represent “clusters” i.e. terms to group a bunch of correlated variables. Selection vs Control, for example, doesn’t have a clear definition/criteria but rather points at a number of correlated things, like internal structure, search, maps, control-like things, etc.
Sure, deconfusing and pointing out clusters is useful because clusters imply correlations and correlations perhaps imply hidden structure + relationships—but I think the costs from cluster-representing-words doing hidden inference is much greater than the benefits, and it would be better to explicitly lay out the features-of-clusters that the one is referring to instead of just using the name of the cluster.
This is similar to the trouble I had with “wrapper-minds,” which is yet another example of a cluster pointing at a bunch of correlated variables, and people using the same term to mean different things.
Anyways, I still feel totally confused about optimization—and while these clusters/frames are useful, I think thinking in terms of them would ensue even more confusion within myself. It’s probably better to take the useful individual parts within the cluster and start deconfusing from the ground-up using those as the building blocks.
By boundaries, I mean a sustaining/propagating system that informationally/causally insulates its ‘viscera’ from the ‘environment,’ and only allows relatively small amounts of deliberate information flow through certain channels in both directions. Living systems are an example of it (from bacteria to humans). It doesn’t even have to be a physically distinct chunk of spacetime, they can be over more abstract variables like societal norms. Agents are an example of it.
I find them very relevant to alignment especially from the direction of detecting such boundary-possessing/agent-like structures embedded in a large AI system and backing out a sparse relationship between these subsystems, which can then be used to e.g., control the overall dynamic. Check out theseposts for more.
A prototypical deliverable would be an algorithm that can detect such ‘boundaries’ embedded in a dynamical system when given access to some representation of the system, performs observations & experiments and returns a summary data structure of all the ‘boundaries’ embedded in a system and their desires/wants, how they game-theoretically relate to one another (sparse causal relevance graph?), the consequences of interventions performed on them, etc—that’s versatile enough to detect e.g., gliders embedded in Game of Life / Particle Lenia, agents playing Minecraft while only given coarse grained access to the physical state of the world, boundary-like things inside LLMs, etc. (I’m inspired by this)
Why do I find the aforementioned directions relevant to this goal?
Critch’s Boundaries operationalizes boundaries/viscera/environment as functions of the underlying variable that executes policies that continuously prevents information ‘flow’ [1] between disallowed channels, quantified via conditional transfer entropy.
Relatedly, Fernando Rosas’s paper on Causal Blankets operationalize boundaries using a similar but subtly different[2] form of mutual information constraint on the boundaries/viscera/environment variables than that of Critch’s. Importantly, they show that such blankets always exist between two coupled stochastic processes (using a similar style of future morph equivalence relation characterization from compmech, and also a metric they call “synergistic coefficient” that quantifies how boundary-like this thing is.[3]
More on compmech, epsilon transducers generalize epsilon machines to input-output processes. PALO (Perception Action Loops) and Boundaries as two epsilon transducers coupled together?
These directions are interesting, but I find them still unsatisfactory because all of them are purely behavioral accounts of boundaries/agency. One of the hallmarks of agentic behavior (or some boundary behaviors) is adapting ones policy if an intervention changes the environment in a way that the system can observe and adapt to.[4][5]
(is there an interventionist extension of compmech?)
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I think the paper is very enlightening from a conceptual perspective, but there are many problems yet to be solved before we can actually implement this. Here’s my take on it.
More fundamentally, (this is more vibes, I’m really out of my depth here) I feel there is something intrinsically limiting with the use of Bayes Nets, especially with the fact that choosing which variables to use in your Bayes Net already encodes a lot of information about the specific factorization structure of the world. I heard good things about finite factored sets and I’m eager to learn more about them.
Not exactly a ‘flow’, because transfer entropy conflates between intrinsic information flow and synergistic information—a ‘flow’ connotes only the intrinsic component, while transfer entropy just measures the overall amount of information that a system couldn’t have obtained on its own. But anyways, transfer entropy seems like a conceptually correct metric to use.
Specifically, Fernando’s paper criticizes blankets of the following form (V for viscera, A and P for active/passive boundaries, E for environment):
Vt→At,Pt→Et
DIP implies I(Vt;At,Pt)≥I(Vt;Et)
This clearly forbids dependencies formed in the past that stays in ‘memory’.
but Critch instead defines boundaries as satisfying the following two criteria:
Vt+1,At+1→Vt,At,Pt→Et (infiltration)
DIP implies I(Vt;At,Pt)≥I(Vt;Et)
Et+1,Pt+1→At,Pt,Et→Vt (exfiltration)
DIP implies I(Vt+1,At+1;At,Pt)≥I(Vt+1,At+1;Et)
and now that the independencies are entangled across different t, there is no longer a clear upper bound on I(Vt;Et), so I don’t think the criticisms apply directly.
My immediate curiosities are on how these two formalisms relate to one another. e.g., Which independency requirements are more conceptually ‘correct’? Can we extend the future-morph construction to construct Boundaries for Critch’s formalism? etc etc
For example, a rock is very goal-directed relative to ‘blocking-a-pipe-that-happens-to-exactly-match-its-size,’ until one performs an intervention on the pipe size to discover that it can’t adapt at all.
Also, interventions are really cheap to run on digital systems (e.g., LLMs, cellular automata, simulated environments)! Limiting oneself to behavioral accounts of agency would miss out on a rich source of cheap information.
Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?
I’ve been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I’ve been assigning less premium on axiomatic conceptual definitions than i used to:
decision theoretic justification of probability > Cox’s theorem
shannon entropy as min description length > three information axioms
The basis of comparison would be its usefulness and ease-of-generalization to better concepts:
at least in the case of fernando’s synergistic information, it seems far more useful because i at least know what i’m exactly getting out of it, unlike having to compare between the axiomatic definitions based on handwavy judgements.
for ease of generalization, the problem with axiomatic definitions is that there are many logically equivalent ways to state the initial axiom (from which they can then be relaxed), and operational motivations seem to ground these equivalent characterizations better, like logical inductors from the decision theoretic view of probability theory
I’m not sure what you mean by operational vs axiomatic definitions.
But Shannon was unaware of the usage of S=−Σipilnpi in statistical mechanics. Instead, he was inspired by Nyquist and Hartley’s work, which introduced ad-hoc definitions of information in the case of constant probability distributions.
And in his seminal paper, “A mathematical theory of communication”, he argued in the introduction for the logarithm as a measure of information because of practicality, intuition and mathematical convenience. Moreover, he explicitly derived the entropy of a distribution from three axioms: 1) that it be continuous wrt. the probabilities, 2) that it increase monotonically for larger systems w/ constant probability distributions, 3) and that it be a weighted sum the entropy of sub-systems. See section 6 for more details.
‘Symmetry’ implies ‘redundant coordinate’ implies ‘cyclic coordinates in your Lagrangian / Hamiltonian’ implies ‘conservation of conjugate momentum’
And because the action principle (where the true system trajectory extremizes your action, i.e. integral of Lagrangian) works in various dynamical systems, the above argument works in non-physical dynamical systems.
Thus conserved quantities usually exist in a given dynamical system.
mmm, but why does the action principle hold in such a wide variety of systems though? (like how you get entropy by postulating something to be maximized in an equilibrium setting)
Mildly surprised how some verbs/connectives barely play any role in conversations, even in technical ones. I just tried directed babbling with someone, and (I think?) I learned quite a lot about Israel-Pakistan relations with almost no stress coming from eg needing to make my sentences grammatically correct.
Example of (a small part of) my attempt to summarize my understanding of how Jews migrated in/out of Jerusalem over the course of history:
They here *hand gesture on air*, enslaved out, they back, kicked out, and boom, they everywhere.
(audience nods, given common knowledge re: gestures, meaning of “they,” etc)
My (completely amateur) understanding is that the “extra” semantic and syntactic structure of written and spoken language does two things.
One, it adds redundancy and reduces error. Simple example, gendered pronouns mean that when you hear “Have you seen Laurence? She didn’t get much sleep last night.” you have a chance to ask the speaker for clarification and catch if they had actually said “Laura” and you misheard.
Two, it can be used as a signal. The correct use of jargon is used by listeners or readers as a proxy for competence. Or many typos in your text will indicate to readers that you haven’t put much effort into what you’re saying.
Why haven’t mosquitos evolved to be less itchy? Is there just not enough selection pressure posed by humans yet? (yes probably) Or are they evolving towards that direction? (they of course already evolved towards being less itchy while biting, but not enough to make that lack-of-itch permanent)
this is a request for help i’ve been trying and failing to catch this one for god knows how long plz halp
tbh would be somewhat content coexisting with them (at the level of houseflies) as long as they evolved the itch and high-pitch noise away, modulo disease risk considerations.
The reason mosquito bites itch is because they are injecting saliva into your skin. Saliva contains mosquito antigens, foreign particles that your body has evolved to attack with an inflammatory immune response that causes itching. The compound histamine is a key signaling molecule used by your body to drive this reaction.
In order for the mosquito to avoid provoking this reaction, they would either have to avoid leaving compounds inside of your body, or mutate those compounds so that they do not provoke an immune response. The human immune system is an adversarial opponent designed with an ability to recognize foreign particles generally. If it was tractable for organisms to reliably evolve to avoid provoking this response, that would represent a fundamental vulnerability in the human immune system.
Mosquitoe saliva does in fact contain anti-inflammatory, antihemostatic, and immunomodulatory compounds. So they’re trying! But also this means that mosquitos are evolved to put saliva inside of you when they feed, which means they’re inevitably going to expose the foreign particles they produce to your immune system.
There’s also a facet of selection bias making mosquitos appear unsuccessful at making their bites less itchy. If a mosquito did evolve to not provoke (as much of) an immune response and therefore less itching, redness and swelling, you probably wouldn’t notice they’d bitten you. People often perceive that some are prone to getting bitten, others aren’t. It may be that some of this is that some people don’t have as serious an immune response to mosquito bites, so they think they get bitten less often.
I’m sure there are several PhDs worth of research questions to investigate here—I’m a biomedical engineer with a good basic understanding of the immune system, but I don’t study mosquitos.
Because they have no reproductive advantage to being less itchy. You can kill them while they’re feeding, which is why they put lots of evolutionary effort into not being noticed. (They have an anesthetic in their saliva so you are unlikely to notice the bite.) By the time you develop the itchy bump, they’ve flown away and you can’t kill them.
There’s still some pressure, though. If the bites were permanently not itchy, then I may have not noticed that the mosquitos were in my room in the first place, and consequently would less likely pursue them directly. I guess that’s just not enough.
There’s also positive selection for itchiness. Mosquito spit contains dozens of carefully evolved proteins. We don’t know what they all are, but some of them are anticoagulants and anesthetics. Presumably they wouldn’t be there if they didn’t have a purpose. And your body, when it detects these foreign proteins, mounts a protective reaction, causing redness, swelling, and itching. IIRC, that reaction does a good job of killing any viruses that came in with the mosquito saliva. We’ve evolved to have that reaction. The itchiness is probably good for killing any bloodsuckers that don’t flee quickly. It certainly works against ticks.
Evolution is not our friend. It doesn’t give us what we want, just what we need.
I believe mosquitos do inject something to suppress your reaction to them, which is why you don’t notice bug bites until long after the bug is gone. There’s no reproductive advantage to the mosquito to extending that indefinitely.
I had something like locality in mind when writing this shortform, the context being: [I’m in my room → I notice itch → I realize there’s a mosquito somewhere in my room → I deliberately pursue and kill the mosquito that I wouldn’t have known existed without the itch]
But, again, this probably wouldn’t amount to much selection pressure, partially due to the fact that the vast majority of mosquito population exists in places where such locality doesn’t hold i.e. in an open environment.
But the evolutionary timescale at which mosquitos can adapt to avoid detection must be faster than that of humans adapting to find mosquitos itchy! Or so I thought—my current boring guess is that (1) mechanisms for the human body to detect foreign particles are fairly “broad”, (2) the required adaptation from the mosquitos to evade them are not-way-too-simple, and (3) we just haven’t put enough selection pressure to make such change happen.
Just noticing that the negation of a statement exists is enough to make meaningful updates.
e.g. I used to (implicitly) think “Chatbot Romance is weird” without having evaluated anything in-depth about the subject (and consequently didn’t have any strong opinions about it)—probably as a result of some underlying cached belief.
But after seeing this post, just reading the title was enough to make me go (1) “Oh! I just realized it is perfectly possible to argue in favor of Chatbot Romance … my belief on this subject must be a cached belief!” (2) hence is probably by-default biased towards something like the consensus opinion, and (3) so I should update away from my current direction, even without reading the post.
(Note: This was a post, but in retrospect was probably better to be posted as a shortform)
(Epistemic Status: 20-minute worth of thinking, haven’t done any builder/breaker on this yet although I plan to, and would welcome any attempts in the comment)
Have an algorithmic task whose input/output pair could (in reasonable algorithmic complexity) be generated using highly specific combination of modular components (e.g., basic arithmetic, combination of random NN module outputs, etc).
Train a small transformer (or anything, really) on the input/output pairs.
Take a large transformer that takes the activation/weights, and outputs a computational graph.
Train that large transformer over the small transformer, across a diverse set of such algorithmic tasks (probably automatically generated) with varying complexity. Now you have a general tool that takes in a set of high-dimensional matrices and backs-out a simple computational graph, great! Let’s call it Inspector.
Apply the Inspector in real models and see if it recovers anything we might expect (like induction heads).
To go a step further, apply the Inspector to itself. Maybe we might back-out a human implementable general solution for mechanistic interpretability! (Or, at least let us build a better intuition towards the solution.)
(This probably won’t work, or at least isn’t as simple as described above. Again, welcome any builder/breaker attempts!)
People mean different things when they say “values” (object vs meta values)
I noticed that people often mean different things when they say “values,” and they end up talking past each other (or convergence only happens after a long discussion). One of the difference is in whether they contain meta-level values.
Some people refer to the “object-level” preferences that we hold.
Often people bring up the “beauty” of the human mind’s capacity for its values to change, evolve, adopt, and grow—changing mind as it learns more about the world, being open to persuasion via rational argumentation, changing moral theories, etc.
Some people include the meta-values (that are defined on top of other values, and the evolution of such values).
e.g., My “values” include my meta-values, like wanting to be persuaded by good arguments, wanting to change my moral theories when I get to know better, even “not wanting my values to be fixed”
example of this view: carado’s post on you want what you want, and one of Vanessa Cosoy’s shortform/comment (can’t remember the link)
I don’t know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)
For some reason the “only if” always throws me off. It reminds me of the unless keyword in ruby, which is equivalent to if not, but somehow always made my brain segfault.
I think the interchangeability is just hard to understand. Even though I know they are the same thing, it is still really hard to intuitively see them as being equal. I personally try (but not very hard) to stick with X → Y in mathy discussions and if/only if for normal discussions
Unidimensional Continuity of Preference ≈ Assumption of “Resources”?
tl;dr, the unidimensional continuity of preference assumption in the money pumping argument used to justify the VNM axioms correspond to the assumption that there exists some unidimensional “resource” that the agent cares about, and this language is provided by the notion of “souring / sweetening” a lottery.
Various coherence theorems—or more specifically, various money pumping arguments generally have the following form:
If you violate this principle, then [you are rationally required] / [it is rationally permissible for you] to follow this trade that results in you throwing away resources. Thus, for you to avoid behaving pareto-suboptimally by throwing away resources, it is justifiable to call this principle a ‘principle of rationality,’ which you must follow.
… where “resources” (the usual example is money) are something that, apparently, these theorems assume exist. They do, but this fact is often stated in a very implicit way. Let me explain.
In the process of justifying the VNM axioms using money pumping arguments, one of the three main mathematical primitives are: (1) lotteries (probability distribution over outcomes), (2) preference relation (general binary relation), and (3) a notion of Souring/Sweetening of a lottery. Let me explain what (3) means.
Souring of A is denoted A−, and a sweetening of A is denoted A+.
A− is to be interpreted as “basically identical with A but strictly inferior in a single dimension that the agent cares about.” Based on this interpretation, we assume A>A−. Sweetening is the opposite, defined in the obvious way.
Formally, souring could be thought of as introducing a new preference relation A>uniB, which is to be interpreted as “lottery B is basically identical to lottery A, but strictly inferior in a single dimension that the agent cares about”.
On the syntactic level, such B is denoted as A−.
On the semantic level, based on the above interpretation, >uni is related to > via the following: A>uniB⟹A>B
This is where the language to talk about resources come from. “Something you can independently vary alongside a lottery A such that more of it makes you prefer that option compared to A alone” sounds like what we’d intuitively call a resource[1].
Now that we have the language, notice that so far we haven’t assumed sourings or sweetenings exist. The following assumption does it:
Unidimensional Continuity of Preference: If X>Y, then there exists a prospect X− such that 1) X− is a souring of X and 2) X>X−>Y.
Which gives a more operational characterization of souring as something that lets us interpolate between the preference margins of two lotteries—intuitively satisfied by e.g., money due to its infinite divisibility.
So the above assumption is where the assumption of resources come into play. I’m not aware of any money pump arguments for this assumption, or more generally, for the existence of a “resource.” Plausibly instrumental convergence.
I don’t actually think this + the assumption below fully capture what we intuitively mean by “resources”, enough to justify this terminology. I stuck with “resources” anyways because others around here used that term to (I think?) refer to what I’m describing here.
Thinking about for some time my feeling has been that resources are about fungibility implicitly embedded in a context of trade, multiple agents (very broadly construed. E.g. an agent in time can be thought of as multiple agents cooperating intertemporally perhaps).
A resource over time has the property that I can spend it now or I can spend it later. Glibly, one could say the operational meaning of the resource arises from the intertemporal bargaining of the agent.
Perhaps it’s useful to distinguish several levels of resources and resource-like quantities.
Discrete vs continuous, tradeable / meaningful to different agents, ?? Fungibility, ?? Temporal and spatial locatedness, ?? Additivity?, submodularity ?
Addendum: another thing to consider is that the input of the vNM theorem is in some sense more complicated than the output.
The output is just a utility function u: X → R, while your input is a preference order on the very infinite set of lotteries (= probability distributions ) L(X).
Thinking operationally about a preference ordering on a space of distribution is a little wacky.
It means you are willing to trade off uncertain options against one another. For this to be a meaningful choice would seem to necessitate some sort of (probabilistic) world model.
Damn, why did Pearl recommend readers (in the preface of his causality book) to read all the chapters other than chapter 2 (and the last review chapter)? Chapter 2 is literally the coolest part—inferring causal structure from purely observational data! Almost skipped that chapter because of it …
it’s true it’s cool, but I suspect he’s been a bit disheartened by how complicated it’s been to get this to work in real-world settings.
in the book of why, he basically now says it’s impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books.
but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans
Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.
Level 0: Naive Marginalization
No dynamic programming whatsoever. Just multiply all the conditional probability distribution (CPD) tables, and sum over the variables of non-interest.
Level 1: Variable Elimination
Cache the repeated computations within a query.
For example, given a chain-structured Bayes Net A⟶B⟶C⟶D, instead of doing P(D)=∑A∑B∑CP(A,B,C,D), we can do P(D)=∑CP(D|C)∑BP(C|B)∑AP(A)P(B|A). Check my post for more.
Suppose you have a fixed Bayes Net, and you want to compute the marginalization not only P(D), but also P(A). Clearly running two instances of Variable Elimination as above is going to contain some overlapping computation.
Clique-tree is a data structure where, given the initial factors (in this case the CPD tables), you “calibrate” a tree whose nodes correspond to a subset of the variables. Cost can be amortized by running many queries over the same Bayes Net.
Calibration can be done by just two passes across the tree, after which you have the joint marginals for all the nodes of the clique tree.
Incorporating evidence is equally simple. Just zero-out the entries of variables that you are conditioning on for some node, then “propagate” that information downwards via a single pass across the tree.
Level 3: Specialized query-set answering algorithms over a calibrated clique tree.
Cache the repeated computations across a certain query-class
e.g., computing P(X,Y) for every pair of variables can be done by using yet another layer of dynamic programming by maintaining a table of P(Ci,Cj) for each pair of clique-tree nodes ordered according to their distance in-between.
what are macrostates? Variables which are required to make your thermodynamics theory work! If they don’t, add more macrostates!
nonequilibrium? Define it as systems that don’t admit a thermodynamic description!
inductive biases? Define it as the amount of correction needed for a system to obey Bayesian updating, i.e. correction terms in the exponent of the Gibbs measure!
coarse graining? Define the coarse-grained variables to keep the dynamics as close as possible to that of the micro-dynamics!
or in a similar spirit—does your biological system deviate from expected utility theory? Well, there’s discovery (and money) to be made!
It’s easy to get confused and think the circularity is a problem (“how can you define thermodynamics in terms of equilibriums, when equilibriums are defined using thermodynamics?”), but it’s all about carving nature at the right joints—and a sign that you made the right carving is that the amount of corrections needed to be applied aren’t too numerous, and they all seem “natural” (and of course, all of this while letting you make nontrivial predictions. that’s what matters at the end of the day).
Then, it’s often the case that those corrections also turn out to be meaningful and natural quantities of interest.
One of the rare insightful lessons from high school: Don’t set your AC to the minimum temperature even if it’s really hot, just set it to where you want it to be.
It’s not like the air released gets colder with lower target temperature, because most ACs (according to my teacher, I haven’t checked lol) are just a simple control system that turns itself on/off around the target temperature, meaning the time it takes to reach a certain temperature X is independent of the target temperature (as long it’s lower than X)
Well is he is right about some ACs being simple on/off units.
But there also exists units than can change cycle speed, its basically the same thing except the motor driving the compression cycle can vary in speed.
In case you where wondering, they are called inverters. And when buying new today, you really should get an inverter (efficiency).
I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis’s value-compilation hypothesis, etc) and forming my own.
I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It’s probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
(and also think about related theoretical arguments like Selection Theorem)
So I’ll probably allocate my research time:
Studying math (more linear algebra / dynamical systems / causal inference / statistical mechanics)
Sketching a better picture of agent development, assigning confidence, proposing high-bit experiments (that might have the side-effect of distinguishing between different conflicting pictures), formalization, etc.
and read relevant literature (eg ones on theoretic DL and inductive biases)
Upskilling mechanistic interpretability to actually start running quick experiments
Unguided research brainstorming (e.g., going through various alignment exercises, having a writeup of random related ideas, etc)
Possibly participate in programs like MATS? Probably the biggest benefit to me would be (1) commitment mechanism / additional motivation and (2) high-value conversations with other researchers.
Useful perspective when thinking of mechanistic pictures of agent/value development is to take the “perspective” of different optimizers, consider their relative “power,” and how they interact with each other.
E.g., early on SGD is the dominant optimizer, which has the property of (having direct access to feedback from U / greedy). Later on early proto-GPS (general-purpose search) forms, which is less greedy, but still can largely be swayed by SGD (such as having its problem-specification-input tweaked, having the overall GPS-implementation modified, etc). Much later, GPS becomes the dominant optimizing force “at run-time” which shortens the relevant time-scale and we can ignore the SGD’s effect. This effect becomes much more pronounced after reflectivity + gradient hacking when the GPS’s optimization target becomes fixed.
(very much inspired by reading Thane Ruthenis’s value formation post)
This is a very useful approximation at the late-stage when the GPS self-modifies the agent in pursuit of its objective! Rather than having to meticulously think about local SGD gradient incentives and such, since GPS is non-greedy, we can directly model it as doing what’s obviously rational from a birds-eye-perspective.
(kinda similar to e.g., separation of timescale when analyzing dynamical systems)
It seems like retrieval-based transformers like RETRO is “obviously” the way to go—(1) there’s just no need to store all the factual information as fixed weights, (2) and it uses much less parameter/memory. Maybe mechanistic interpretability should start paying more attention to these type of architectures, especially since they’re probably going to be a more relevant form of architecture.
They might also be easier to interpret thanks to specialization!
I’ve noticed during my alignment study that just the sheer amount of relevant posts out there is giving me a pretty bad habit of (1) passively engaging with the material and (2) not doing much independent thinking. Just keeping up to date & distilling the stuff in my todo read list takes up most of my time.
I guess the reason I do it is because (at least for me) it takes a ton of mental effort to switch modes between “passive consumption” and “active thinking”:
I noticed then when self-studying math; like, my subjective experience is that I enjoy both “passively listening lectures+taking notes” and “solving practice problems,” the problem is that it takes a ton of mental energy to switch between the two equilibriums.
(This is actually still a problem—too much wide & passive consumption rather than actively practicing them and solving problems.)
Also relevant is wanting to just progress/upskill as fast and wide of a subject as I can, sacrificing mastery for diversity. This probably makes sense to some degree (especially in the sense that having more frames is good), but I think I’m taking this wayyyy too far.
My r for opening new links far exceeds 1. This definitely helped me when I was trying to get a rapid overview of the entire field, but now it’s just a bad adaptation + akrasia.
Okay, then, don’t do that! Some directions to move towards:
Independent brainstorming/investigation sessions to form concrete inside views
There are lots of posts but the actual content is very thing. I would say there is plausibly more content in your real analysis book than there is in the entire alignment field.
(Epistemic Status: I don’t endorse this yet, just thinking aloud. Please let me know if you want to act/research based on this idea)
It seems like it should be possible to materialize certain forms of AI alignment failure modes with today’s deep learning algorithms, if we directly optimize for their discovery. For example, training a Gradient Hacker Enzyme.
A possible benefit of this would be that it gives us bits of evidence wrt how such hypothesized risks would actually manifest in real training environments. While the similarities would be limited because the training setups would be optimizing for their discovery, it should at least serve as a good lower bound for the scenarios in which these risks could manifest.
Perhaps having a concrete bound for when dangerous capabilities appear (eg a X parameter model trained in Y modality has Z chance of forming a gradient hacker) would make it easier for policy folks to push for regulations.
Is AI gain-of-function equally dangerous as biotech gain-of-function? Some arguments in favor (of the former being dangerous):
The malicious actor argument is probably stronger for AI gain-of-function.
if someone publicly releases a Gradient Hacker Enzyme, this lowers the resource that would be needed for a malicious actor to develop a misaligned AI (eg plug in the misaligned Enzyme at an otherwise benign low-capability training run).
Risky researcher incentive is equally strong.
e.g., a research lab carelessly pursuing gain-of-function research, deliberately starting risky training runs for financial/academic incentives, etc.
Some arguments against:
Accident risks from financial incentives are probably weaker for AI gain-of-function.
The standard gain-of-function risk scenario is: research lab engineers a dangerous pathogen, it accidentally leaks, and a pandemic happens.
I don’t see how these events would happen “accidentally” when dealing with AI programs; e.g., the researcher would have to deliberately cut parts of the network weights and replace it with the enzyme, which is certainly intentional.
Random alignment-related idea: train and investigate a “Gradient Hacker Enzyme”
TL;DR, Use meta-learning methods like MAML to train a network submodule i.e. circuit that would resist gradient updates in a wide variety of contexts (various architectures, hyperparameters, modality, etc), and use mechanistic interpretability to see how it works.
It should be possible to have a training setup for goals other than “resist gradient updates,” such as restricting the meta-objective to a specific sub-sub-circuit. In that case, the outer circuit might (1) instrumentally resist updates, or (2) somehow get modified while keeping its original behavioral objective intact.
This setup doesn’t have to be restricted to circuits of course; there was a previous work which did this on the level of activations, although iiuc the model found a trivial solution by exploiting relu—it would be interesting to extend this to more diverse setup.
Anyways, varying this “sub-sub-circuit/activation-to-be-preserved” over different meta-learning episodes would incentivize the training process to find “general” Gradient Hacker designs that aren’t specific to a particular circuit/activation—a potential precursor for various forms of advanced Gradient Hackers (and some loose analogies to how enzymes accelerate reactions).
What is the Theory of Impact for training a “Gradient Hacker Enzyme”?
(note: while I think these are valid, they’re generated post-hoc and don’t reflect the actual process for me coming up with this idea)
Estimating the lower-bound for the emergence Gradient Hackers.
By varying the meta-learning setups we can get an empirical estimate for the conditions in which Gradient Hackers are possible.
Perhaps gradient hackers are actually trivial to construct using tricks we haven’t thought of before (like the relu example before). Maybe not! Perhaps they require [high-model-complexity/certain-modality/reflective-agent/etc].
Why lower-bound? In a real training environment, gradient hackers appear because of (presumably) convergent training incentives. Instead in the meta-learning setup, we’re directly optimizing for gradient hackers.
Mechanistically understanding how Gradient Hackers work.
Applying mechanistic interpretability here might not be too difficult, because the circuit is cleanly separated from the rest of the model.
There has been severalspeculationson how such circuits might emerge. Testing them empirically sounds like a good idea!
This is just a random idea and I’m probably not going to work on it; but if you’re interested, let me know. While I don’t think this is capabilities-relevant, this probably falls under AI gain-of-function research and should be done with caution.
Update: I’m trying to upskill mechanistic interpretability, and training a Gradient Hacker Enzyme seems like a fairly good project just to get myself started.
I don’t think this project would be highly valuable in and of itself (although I would definitely learn a lot!), so one failure mode I need to avoid is ending up investing too much of my time in this idea. I’ll probably spend a total of ~1 week working on it.
The Metaphysical Structure of Pearl’s Theory of Time
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says “[The Pearlian Theory of Time] … is the best thing to happen to our understanding of time since Einstein”. I read Pearl’s book on Causality[1], and while there’s math, this metaphysical connection that Scott seems to make isn’t really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl’s book.
Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.
Causality leave a shadow of conditional independence relationships over the observational distribution. Here’s an explanation providing the core intuition:
Suppose you represent the ground truth structure of [causality / determination] of the world via a Structural Causal Model over some variables, a very reasonable choice. Then, as you go down the Pearlian Rung: SCM →[2] Causal Bayes Net →[3] Bayes Net, theorems guarantee that the Bayes Net is still Markovian wrt the observational distribution.
(Read Timeless Causality for an intuitive example.)
Causal Discovery then (at least in this example) reduces to inferring the equation assignment directions of the SCM, given only the observational distribution.
The earlier result guarantees that all you have to do is find a Bayes Net that is Markovian wrt the observational distribution. Alongside the faithfulness assumption, this thus reduces to finding a Bayes Net structure G whose set of independencies (implied by d-separation) are identical to that of P (or, finding the Perfect Map of a distribution[4]).
Then, at least some of the edges of the Perfect Map will have its directions nailed down by the conditional independence relations.
The metaphysical claim is that, this direction is the definition of time[5], morally so, based on the intuition provided by the example above.
So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.
Abstracting away, the structure of any Theory of Time is then to:
find a mathematical structure [in the Pearlian Theory of Time, a Bayes Net]
… that has gadgets [d-separation]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [parenthood relationship of nodes in a Bayes Net]
… while this order induced from the gadget coinciding to that of d-separation [trivially so here, because we’re talking about Bayes Nets and d-separation] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
This is exactly what Factored Space Model does:
find a mathematical structure [Factored Space Model]
… that has gadgets [structural independence]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [preorder relation induced by the subset relationship of the History]
… while this order induced from the gadget coinciding to that of d-separation [by a theorem of FSM] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
while, additionally, generalizing the scope of our Theory of Time from [variables that appear in the Bayes Net] to [any variables defined over the factored space].
… thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.
Chapter 2, specifically, which is about Causal Discovery. All the other chapters are mostly irrelevant for this purpose.
By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.
By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.
This shortform answers this question I had.
Pearl comes very close. In his Temporal Bias Conjecture (2.8.2):
(where statistical time refers to the aforementioned direction.)
But doesn’t go as far as this ought to be the definition of Time.
This approach goes back to Hans Reichenbach’s book The Direction of Time. I think the problem is that the set of independencies alone is not sufficient to determine a causal and temporal order. For example, the same independencies between three variables could be interpreted as the chains A→B→C and A←B←C. I think Pearl talks about this issue in the last chapter.
The critical insight is that this is not always the case!
Let’s call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified—namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:
More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:
Anyone interested in further detail should consult Pearl’s Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).
Ah yes, the fork asymmetry. I think Pearl believes that correlations reduce to causations, so this is probably why he wouldn’t particularly try to, conversely, reduce causal structure to a set of (in)dependencies. I’m not sure whether the latter reduction is ultimately possible in the universe. Are the correlations present in the universe, e.g. defined via the Albert/Loewer Mentaculus probability distribution, sufficient to recover the familiar causal structure of the universe?
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
eg crutchfield talks a lot about developing a right notion of information flow—obvious usefulness in eg formalizing boundaries?
many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.
I agree with you.
Epsilon machine (and MSP) construction is most likely computationally intractable [I don’t know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios.
Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited.
The calculi of emergence paper has inspired many people but has not been developed much. Many of the details are somewhat obscure, vague. I also believe that most likely completely different methods are needed to push the program further. Computational Mechanics’ is primarily a theory of hidden markov models—it doesn’t have the tools to easily describe behaviour higher up the Chomsky hierarchy. I suspect more powerful and sophisticated algebraic, logical and categorical thinking will be needed here. I caveat this by saying that Paul Riechers has pointed out that actually one can understand all these gadgets up the Chomsky hierarchy as infinite HMMs which may be analyzed usefully just as finite HMMs.
The still-underdeveloped theory of epsilon transducers I regard as the most promising lens on agent foundations. This is uncharcted territory; I suspect the largest impact of computational mechanics will come from this direction.
Your point on True Names is well-taken. More basic examples than gauge information, synchronization order are the triple of quantites entropy rate h, excess entropy E and Crutchfield’s statistical/forecasting complexity C. These are the most important quantities to understand for any stochastic process (such as the structure of language and LLMs!)
Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.
I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct U, and then the agent’s behavior is equivalent to a system that is maximizing U.
Says nothing about whether the agent internally constructs U and uses them.
(Little Less Silly version of the) Good regulator theorem: A regulator R that minimizes the entropy of a system variable S (where there is an environment variable X upstream of both R and S) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of S (despite being a function of X).
Says nothing about whether R actually internally reconstructs S and uses it to produce its output.
Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct G′ that is very similar to the true environment G. Then argue: “hence the agent must have something internally isomorphic to G”. Which is true, but …
says nothing about whether the agent actually uses those internal isomorphic-to-G structures in the causal history of computing its output.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
SLT represents a system as a parameterized function—very rich!
In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net—very impoverished!
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes S take minimum entropy, the first node must be isomorphic to the likelihood function s↦P(S=s|X).
This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Here is my 5-minute attempt to put more such “structure” to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:
Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph G, with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph E. Then claim: Agents with low regret must have G that admits an abstracting causal model map (summary) from E, and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
I’m being very very [imprecise/almost misleading] here—because I’m just trying to make a high-level point and the details don’t matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
Yeah, I think structural selection theorems matter a lot, for reasons I discussed here.
This is also one reason why I continue to be excited about Algorithmic Information Theory. Computable functions are behavioral, but programs (= algorithms) are structural! The fact that programs can be expressed in the homogeneous language of finite binary strings gives a clear way to select for structure; just limit the length of your program. We even know exactly how this mathematical parameter translates into real-world systems, because we can know exactly how many bits our ML models take up on the hard drives.
And I think you can use algorithmic information distance to well-define just how close to agent-structured your policy is. First, define the specific program A that you mean to be maximally agent-structured (which I define as a utility-maximizing program). If your policy (as a program) can be described as “Program A, but different in ways X” then we have an upper bound for how close it is to agent-structured it is. X will be a program that tells you how to transform A into your policy, and that gives us a “distance” of at most the length of X in bits.
For a given length, almost no programs act anything like A. So if your policy is only slightly bigger than A, and it acts like A, then it’s probably of the form “A, but slightly different”, which means it’s agent-structured. (Unfortunately this argument needs like 200 pages of clarification.)
It’s maybe also worth saying that any other description method is a subset of programs (or is incomputable and therefore not what real-world AI systems are). So if the theoretical issues in AIT bother you, you can probably make a similar argument using a programming language with no while loop, or I dunno, finite MDPs whose probability distributions are Gaussian with finite parameter descriptions.
[Some thoughts that are similar but different to my previous comment;]
I suspect you can often just prove the behavioral selection theorem and structural selection theorem in separate, almost independent steps.
Prove a behavioral theorem
add in a structural assumption
prove that behavioral result plus structural assumption implies structural result.
Behavior essentially serves as an “interface”, and a given behavior can be implemented by any number of different structures. So it would make sense that you need to prove something about structure separately (and that you can prove it for multiple different types of structural assumption).
Further claims: for any given structural class,
there will be a natural simplicity measure
simpler instances will be exponentially rare.
A structural class is something like programs, or Markov chains, or structural causal models. The point of specifying structure is to in some way model how the system might actually be shaped in real life. So it seems to me that any of these will be specified with a finite string over a finite alphabet. This comes with the natural simplicity measure of the length of the specification string, and there are exponentially fewer short strings than long ones.[1]
So let’s say you want to prove that your thing X which has behavior B has specific structure S. Since structure S has a fixed description length, you almost automatically know that it’s exponentially less likely for X to be one of the infinitely many structures with description length longer than S. (Something similar holds for being within delta of S) The remaining issue is whether there are any other secret structures that are shorter than S (or of similar length) that X could be instead.
Technically, you could have a subset of strings that didn’t grow exponentially. For example, you could, for some reason, decide to specify your Markov chains using only strings of zeros. That would grow linearly rather than exponentially. But this is clearly a less natural specification method.
There is a straightforward compmech take also. If the goal of the agent is simply to predict well (let’s say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states). Importantly the MSP must be used if optimal prediction is achieved.
There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs
Thoughts, @Jeremy Gillen?
Not much to add, I haven’t spent enough time thinking about structural selection theorems.
I’m a fan of making more assumptions. I’ve had a number of conversations with people who seem to make the mistake of not assuming enough. Sometimes leading them to incorrectly consider various things impossible. E.g. “How could an agent store a utility function over all possible worlds?” or “Rice’s theorem/halting problem/incompleteness/NP-hardness/no-free-lunch theorems means it’s impossible to do xyz”. The answer is always nah, it’s possible, we just need to take advantage of some structure in the problem.
Finding the right assumptions is really hard though, it’s easy to oversimplify the problem and end up with something useless.
Yes. I would even say that finding the right assumptions is the most important part of proving nontrivial selection theorems.
I think I ger what you mean, though making more assumptions is perhaps not the best way to think about it. Logic is monotonic (classical logic at least), meaning that a valid proof remains valid even when adding more assumptions. The “taking advantage of some structure” seems to be different.
Hey, some thoughts in case helpful. I was exploring a little bit into the ‘agent structure’ sort of questions and the Good/Gooder regulator landscape.
You can take GR a bit further by looking at a temporally indexed MDP-like causal diagram and applying various bookkeeping transformations. Search ‘combine nodes’ in John’s post on Bayes net algebra and ‘uncombine’ in my comment on the same.
Then you can see a ‘good regulator motif’ across many timesteps and timescales and draw some richer conclusions.
Here’s a comment where I hastily sketch a version of this.
Wasn’t planning to expand on any of those things, but if you think it’d be especially helpful let me know.
Non-Shannon-type Inequalities
The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.
Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.
Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.
Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.
Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:
I(A;B)≤2I(A;B∣C)+I(A;C∣B)+I(B;C∣A)+I(A;B∣D)+I(C;D)Explanation of the math, for anyone curious.
Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).
Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2’s element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to “nonnegative linear combinations” of Shannon-type inequalities.
Then, the claim that “nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures” would correspond to the claim that Γn=Γ∗n for all n.
The content of the papers linked above is to show that:
Γ2=Γ∗2
Γ3≠Γ∗3 but Γ3=¯¯¯¯¯¯Γ∗3 (closure[1])
Γ4≠Γ∗4 and Γ4≠¯¯¯¯¯¯Γ∗4, and also for all n≥4.
This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can’t be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.
@Fernando Rosas
The 3-4 Chasm of Theoretical Progress
epistemic status: unoriginal. trying to spread a useful framing of theoretical progress introduced from an old post.
Tl;dr, often the greatest theoretical challenge comes from the step of crossing the chasm from [developing an impractical solution to a problem] to [developing some sort of a polytime solution to a problem], because the nature of their solutions can be opposites.
Summarizing Diffractor’s post on Program Search and Incomplete Understanding:
Solving a foundational problem to its implementation often takes the following steps (some may be skipped):
developing a philosophical problem
developing a solution given infinite computing power
developing an impractical solution
developing some sort of polytime solution
developing a practical solution
and he says that it is often during the 3 → 4 step in which understanding gets stuck and the most technical and brute-force math (and i would add sometimes philosophical) work is needed, because:
a common motif in 3) is that they’re able to proving interesting things about their solutions, like asymptotic properties, by e.g., having their algorithms iterate through all turing machines, hence somewhat conferring the properties of the really good turing machine solution that exists somewhere in this massive search space to the overall search algorithm (up to a massive constant, usually).
think of Levin’s Universal Search, AIXItl, Logical Induction.
he says such algorithms are secretly a black box algorithm; there are no real gears.
Meanwhile, algorithms in 4) have the opposite nature—they are polynomial often because they characterize exploitable patterns that make a particular class of problems easier than most others, which requires Real Understanding. So algorithms of 3) and 4) often look nothing alike.
I liked this post and the idea of the “3-4 chasm,” because it explicitly captures the vibes of why I personally felt the vibes that, e.g., AIT, might be less useful for my work: after reading this post, I realized that for example when I refer to the word “structure,” I’m usually pointing at the kind of insights required to cross the 3-4 gap, while others might be using the same word to refer to things at a different level. This causes me to get confused as to how some tool X that someone brought up is supposed to help with the 3-4 gap I’m interested in.[1]
Vanessa Cosoy refers to this post, saying (in my translation of her words) that a lot of the 3-4 gap in computational learning theory has to do with our lack of understanding of deep learning theory, like how the NP-complete barrier is circumvented in practical problems, what are restrictions we can put on out hypothesis class to make them efficiently learnable in the same way our world seems efficiently learnable, etc.
She mentions that this gap, at least in the context of deep learning theory, isn’t too much of a pressing problem because it already has mainstream attention—which explains why a lot of her work seems to lie in the 1-3 regime.
I asked GPT for examples of past crossings of the 3-4 chasm in other domains, and it suggested [Shannon’s original technically-constructive-but-highly-infeasible proof for the existence of optimal codes] vs. [recent progress on Turbocodes that actually approach this limit while being very practical], which seems like a perfect example.
AIT in specific seems to be useful primarily in the 1-2 level?
I agree with this framing. The issue of characterizing in what way Our World is Special is the core theoretical question of learning theory.
The way of framing it as a single bottleneck 3-4 maybe understates how large the space of questions is here. E.g. it encompasses virtually every field of theoretical computer science, and physics& mathematics relevant to computation outside of AIT and numerical math.
I’d vote for removing the stage “developing some sort of polytime solution” and just calling 4 “developing a practical solution”. I think listing that extra step is coming from the perspective of something who’s more heavily involved in complexity classes. We’re usually interested in polynomial time algorithms because they’re usually practical, but there are lots of contexts where practicality doesn’t require a polynomial time algorithm, or really, where we’re just not working in a context where it’s natural to think in terms of algorithms with run-times.
What contexts is it not natural to think in terms of algorithms with specific run-times?
Just read through Robust agents learn causal world models and man it is really cool! It proves a couple of bona fide selection theorems, talking about the internal structure of agents selected against a certain criteria.
Tl;dr, agents selected to perform robustly in various local interventional distributions must internally represent something isomorphic to a causal model of the variables upstream of utility, for it is capable of answering all causal queries for those variables.
Thm 1: agents achieving optimal policy (util max) across various local interventions must be able to answer causal queries for all variables upstream of the utility node
Thm 2: relaxation of above to nonoptimal policies, relating regret bounds to the accuracy of the reconstructed causal model
the proof is constructive—an algorithm that, when given access to regret-bounded-policy-oracle wrt an environment with some local intervention, queries them appropriately to construct a causal model
one implication is an algorithm for causal inference that converts black box agents to explicit causal models (because, y’know, agents like you and i are literally that aforementioned ‘regret-bounded-policy-oracle‘)
These selection theorems could be considered the converse of the well-known statement that given access to a causal model, one can find an optimal policy. (this and its relaxation to approximate causal models is stated in Thm 3)
Thm 1 / 2 is like a ‘causal good regulator‘ theorem.
gooder regulator theorem is not structural—as in, it gives conditions under which a model of the regulator must be isomorphic to the posterior of the system—a black box statement about the input-output behavior.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes (eg classification. a negative example would be an mdp). but authors claim this is mostly for simpler proofs and they think this can be relaxed.
yes !! discovered this last week—seems very important the quantitative regret bounds for approximatiions is especially promising
I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”
Quick paper review of Measuring Goal-Directedness from the causal incentives group.
tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance—that one should model a system as an agent insofar as doing so is useful.
Details
how is “policies implied by the utility function” operationalized? given a value u, we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among those policies that attain the utility u.
then union them over all the achievable values of u to get this “wide set of maxent policies,” and define goal directedness of a policy π wrt a utility function U as the maximum (negative) cross entropy between π and an element of the above set. (actually we get the same result if we quantify the min operation over just the set of maxent policies achieving the same utility as π.)
Intuition
intuitively, this is measuring: “how close is my policy π to being ‘deterministic,’ while ‘optimizing U at the competence level u(π)’ and not doing anything else ‘deliberately’?”
“close” / “deterministic” ~ large negative CE means small CE(π,πmaxent)=H(π)+KL(π||πmaxent)
“not doing anything else deliberately’” ~ because we’re quantifying over maxent policies. the policy is maximally uninformative/uncertain, the policy doesn’t take any ‘deliberate’ i.e. low entropy action, etc.
“at the competence level u(π)” ~ … under the constraint that it is identically competent to π
and you get the nice property of the measure being invariant to translation / scaling of U.
obviously so, because a policy is maxent among all policies achieving u on U iff that same policy is maxent among all policies achieving au+b on aU+b, so these two utilities have the same “wide set of maxent policies.”
Critiques
I find this measure problematic in many places, and am confused whether this is conceptually correct.
one property claimed is that the measure is maximum for uniquely optimal / anti-optimal policy.
it’s interesting that this measure of goal-directedness isn’t exactly an ~increasing function of u(π), and i think it makes sense. i want my measure of goal-directedness to, when evaluated relative to human values, return a large number for both aligned ASI and signflip ASI.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy.
My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Relatedly, the fact that the quantification is only happening over policies of the same competence level, which feels problematic.
minimum for uniformly random policy(this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)honestly the maxent motivation isn’t super clear to me.
not causal. the reason you need causal interventions is because you want to rule out accidental agency/goal-directedness, like a rock that happens to be the perfect size to seal a water bottle—does your rock adapt when I intervene to change the size of the hole? discovering agents is excellent in this regards.
Thanks for the feedback!
Yeah, uniqueness definitely doesn’t always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you’re following the unique optimal policy for some utility function, that’s a lot of evidence for goal-directedness. If you’re following one of many optimal policies, that’s a bit less evidence—there’s a greater chance that it’s an accident. In the most extreme case (for the constant utility function) every policy is optimal—and we definitely don’t want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it’s the case.
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I’ve corrected it to this.
Reminds me a little bit of this idea from Vanessa Kosoy.
This link doesn’t work for me:
Thanks, it seems like the link got updated. Fixed!
Thanks for writing this up! Having not read the paper, I am wondering if in your opinion there’s a potential connection between this type of work and comp mech type of analysis/point of view? Even if it doesn’t fit in a concrete way right now, maybe there’s room to extend/modify things to combine things in a fruitful way? Any thoughts?
Here’s my current take, I wrote it as a separate shortform because it got too long. Thanks for prompting me to think about this :)
EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check update.
I believe there’s nothing much in the way of actually implementing an approximation of Critch’s boundaries[1] using deep learning.
Recall, Critch’s boundaries are:
Given a world (markovian stochastic process) Wt, map its values W (vector) bijectively using f into ‘features’ that can be split into four vectors each representing a boundary-possessing system’s Viscera, Active Boundary, Passive Boundary, and Environment.
Then, we characterize boundary-ness (i.e. minimal information flow across features unmediated by a boundary) using two mutual information criterion each representing infiltration and exfiltration of information.
And a policy of the boundary-posessing system (under the ‘stance’ of viewing the world implied by f) can be viewed as a stochastic map (that has no infiltration/exfiltration by definition) that best approximates the true Wt dynamics.
The interpretation here (under low exfiltration and infiltration) is that f can be viewed as a policy taken by the system in order to perpetuate its boundary-ness into the future and continue being well-described as a boundary-posessing system.
All of this seems easily implementable using very basic techniques from deep learning!
Bijective feature map are implemented using two NN maps each way, with an autoencoder loss.
Mutual information is approximated with standard variational approximations. Optimize f to minimize it.
(the interpretation here being—we’re optimizing our ‘stance’ towards the world in a way that best views the world as a boundary-possessing system)
After you train your ‘stance’ using the above setup, learn the policy using an NN with standard SGD, with fixed f.
A very basic experiment would look something like:
Test the above setup on two cellular automata (e.g., GoL, Lenia, etc) systems, one containing just random ash, and the other some boundary-like structure like noise-resistant glider structures found via optimization (there are a lot of such examples in the Lenia literature).[2]
Then (1) check if the infiltration/exfiltration values are lower for the latter system, and (2) do some interp to see if the V/A/P/E features or the learned policy NN have any interesting structures.
I’m not sure if I’d be working on this any time soon, but posting the idea here just in case people have feedback.
I think research on boundaries—both conceptual work and developing practical algorithms for approximating them & schemes involving them—are quite important for alignment for reasons discussed earlier in my shortform.
Ultimately we want our setup to detect boundaries that aren’t just physically contiguous chunks of matter, like informational boundaries, so we want to make sure our algorithm isn’t just always exploiting basic locality heuristics.
I can’t think of a good toy testbed (ideas appreciated!), but one easy thing to try is to just destroy all locality by mapping the automata lattice (which we were feeding as input) with the output of a complicated fixed bijective map over it, so that our system will have to learn locality if it turns out to be a useful notion in its attempt at viewing the system as a boundary.
I don’t see much hope in capturing a technical definition that doesn’t fall out of some sort of game theory, and even the latter won’t directly work for boundaries as representation of respect for autonomy helpful for alignment (as it needs to apply to radically weaker parties).
Boundaries seem more like a landmark feature of human-like preferences that serves as a test case for whether toy models of preference are reasonable. If a moral theory insists on tiling the universe with something, it fails the test. Imperative to merge all agents fails the test unless the agents end up essentially reconstructed. And with computronium, we’d need to look at the shape of things it’s computing rather than at the computing substrate.
I think it’s plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch’s definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.
But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a lot of problems—some of which I suspect might be easier to solve if we treat systems-with-high-boundaryness as a sort of primitive for the kind-of-thing that we can associate agency and preferences with in the first place.
There are two different points here, boundaries as a formulation of agency, and boundaries as a major component of human values (which might be somewhat sufficient by itself for some alignment purposes). In the first role, boundaries are an acausal norm that many agents end up adopting, so that it’s natural to consider a notion of agency that implies boundaries (after the agent had an opportunity for sufficient reflection). But this use of boundaries is probably open to arbitrary ruthlessness, it’s not respect for autonomy of someone the powers that be wouldn’t sufficiently care about. Instead, boundaries would be a convenient primitive for describing interactions with other live players, a Schelling concept shared by agents in this sense.
The second role as an aspect of values expresses that the agent does care about autonomy of others outside game theoretic considerations, so it only ties back to game theory by similarity, or through the story of formation of such values that involved game theory. A general definition might be useful here, if pointing AIs at it could instill it into their values. But technical definitions don’t seem to work when you consider what happens if you try to protect humanity’s autonomy using a boundary according to such definitions. It’s like machine translation, the problem could well be well-defined, but impossible to formally specify, other than by gesturing at a learning process.
I no longer think the setup above is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.
(Note: I am thinking as I’m writing, so this might be a bit rambly.)
The world-trajectory distribution is ambiguous.
Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.
Now, notice that the measure of infiltration/exfiltration depends on ϕ∈Δ(Wω), a distribution over world history. Infil(ϕ):=Aggt≥0MutWω∼ϕ((Vt+1,At+1);Et∣(Vt,At,Pt))
So, for the above measure to capture my intuition, the approximate Markov condition (operationalized by low infil & exfil) must consider the world state Wω that contains the Lenia pattern with it avoiding bullets.
Remember, W is the raw world state, no coarse graining. So ϕ is the distribution over the raw world trajectory. It already captures all the “potentially occurring trajectories under which the system may take boundary-preserving-action.” Since everything is observed, our distribution already encodes all of “Nature’s Intervention.” So in some sense Critch’s definition is already causal (in a very trivial sense), by the virtue of requiring a distribution over the raw world trajectory, despite mentioning no Pearlian Causality.
Issue: Choice of ϕ
Maybe there is some canonical true ϕ for our physical world that minds can intersubjectively arrive at, so there’s no ambiguity.
But when I imagine trying to implement this scheme on Lenia, there’s immediately an ambiguity as to which distribution (representing my epistemic state on which raw world trajectories that will “actually happen”) we should choose:
Perhaps a very simple distribution: assigning uniform probability over world trajectories where the world contains nothing but the glider moving in a random direction with some initial point offset.
I suspect many stances other the one factorizing the world into gliders would have low infil/exfil, because the world is so simple. This is the case of “accidental boundary-ness.”
Perhaps something more complicated: various trajectories where e.g., the Lenia patterns encounters bullets, evolves with various other patterns, etc.
This I think rules out “accidental boundary-ness.”
I think the latter works. But now there’s a subjective choice of the distribution, and what are the set of possible/realistic “Nature’s Intervention”—all the situations that can ever be encountered by the system under which it has boundary-like behaviors—that we want to implicitly encode into our observational distribution. I don’t think it’s natural for ϕ assign much probability to a trajectory whose initial conditions are set in a very precise way such that everything decays into noise. But this feels quite subjective.
Hints toward a solution: Causality
I think the discussion above hints at a very crucial insight:
ϕ must arise as a consequence of the stable mechanisms in the world.
Suppose the world of Lenia contains various stable mechanisms like a gun that shoots bullets at random directions, scarce food sources, etc.
We want ϕ to describe distributions that the boundary system will “actually” experience in some sense. I want the “Lenia pattern dodges bullet” world trajectory to be considered, because there is a plausible mechanism in the world that can cause such trajectories to exist. For similar reasons, I think the empty world distributions are impoverished, and a distribution containing trajectories where the entire world decays into noise is bad because no mechanism can implement it.
Thus, unless you have a canonical choice of ϕ, a better starting point would be to consider the abstract causal model that encodes the stable mechanisms in the world, and using Discovering Agents-style interventional algorithms that operationalize the notion “boundaries causally separate environment and viscera.”
Well, because of everything mentioned above on how the causal model informs us on which trajectories are realistic, especially in the absence of a canonical ϕ. It’s also far more efficient, because the knowledge of the mechanism informs the algorithm of the precise interventions to query the world for, instead of having to implicitly bake them in ϕ.
There are still a lot more questions, but I think this is a pretty clarifying answer as to how Critch’s boundaries are limiting and why DA-style causal methods will be important.
I think the update makes sense in general, isn’t there however some way mutual information and causality is linked? Maybe it isn’t strong enough for there to be an easy extrapolation from one to the other.
Also I just wanted to drop this to see if you find it interesting, kind of on this topic? Im not sure its fully defined in a causality based way but it is about structure preservation.
https://youtu.be/1tT0pFAE36c?si=yv6mbswVpMiywQx9
Active Inference people also have the boundary problem as core in their work so they have some interesting stuff on it.
Yeah I’d like to know if there’s a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn’t show up anything interesting. My guess is that we’d want separate boundary metrics for informational separation and causal separation.
Becoming Stronger™ (Sep 13 - Sep 27)
Notes and reflections on the things I’ve learned while Doing Scholarship this week (i.e. studying math)[1].
I am starting to see the value of categorical thinking.
For example from [FOAG], it was quite mindblowing to learn that stalk (the set of germs at a point) can be equivalently defined as a simple colimit of sections of presheaf over open sets of X containing a point, and this definition made proving certain constructions (eg inducing a map of stalks from a map ϕ:X→Y) very easy.
Also, I was first introduced the concept of presheaf as an abstraction of a map that takes open sets and returns functions over it, abstracting properties like there existing a restriction map that composes naturally. Turns out (punchline, presumably) this is just a functor Open(X)Op−>Set!
Yoneda lemma is very cool. I recall seeing some of the ideas from Programs as Singularities (paper), where there are ideas of embedding programs (for the analogue in Yoneda lemma, the Yoneda embedding being a fully faithful functor from some category C … ) into a different space (… to the category SetC …) that contains “almost programs” ( … because the Yoneda embedding is not surjective, let alone essentially surjective), and that studying this enlarged space lends insight into the original space.
Rabbit hole: Yoneda lemma as expressing consistency conditions on Lawvere’s idea of Space and Quantity being presheaf and copresheaf??
I am also starting to more appreciate the notion of sheaf or ringed spaces from [FOAG] - or more generally, the notion that a “space” can be productively studied by studying functions defined on it. For example I learned from [Bredon] that a manifold, whose usual definition is a topological space locally homeomorphic to a Euclidean space, can equivalently be defined as a ringed space whose structure sheaf is valued in some subalgebra of continuous maps over a given open set. Very cool!
Started reading [Procesi] to learn invariant theory and representation theory because it came up quite often as my bottleneck in my recent work (eg). Also interpretability, apparently. So far I just read pg 1-9, reviewing the very basics of group action (e.g., orbit stabilizer theorem). Lie groups aren’t coming up until pg ~50 so until then I should catch up on the relevant Lie group prerequisites through [Lee] or [Bredon].
Also reviewed some basic topology by skimming pg 1-50 of [Bredon]. So many rabbit holes just in point-set topology that I can’t afford (time-wise) to follow, e.g.,
(1) nets generalize sequences and successfully characterize topological properties—I once learned of filters, and I do not yet know how they relate (and don’t relate constructively + why constructively filters are more natural) and especially universal net vs ultrafilter
(2) I didn’t know that manifolds are metrizable, but yes they are by an easy consequence of the Urysohn metrization theorem (second-countable & completely regular ⇒ metrizable). But I would like to study this proof in more detail. Also, how to intuitively think about the metric of a manifold?
I didn’t know that the proof to Urysohn metrization was this nice! It’s a consequence of the following lemma: recall, “completely regular” means given a point x and a closed set x∉C, there exists a continuous f:X→[0,1] s.t.f(x)=0 and f(C)=1. BUT adding second-countability to the hypothesis then lets you choose this f from a fixed, countable family F.
Then, mapping X under this countable family of functions (thus taking value in [0,1]N) turns out to be an embedding—and [0,1]N can be metrized, so X can be metrized as well.
(3) I learned about various natural variants / generalizations of compactness (σ-compactness, local compactness, paracompactness). My understanding of their importance is because:
(a) paracompactness implies the existence of partition of unity subordinate to any open cover (a consequence of paracompactness ⇒ normal, and Urysohn’s lemma, also paracompactness by definition allowing you to find the open refinement of the given open cover as required by the definition of partition of unity subordinate to an open cover.)
(b) for locally compact Hausdorff X, we can characterize paracompactness by “disjoint union of open σ-compact subsets,” which is much easier to check than the definition of paracompactness as locally finite open refinement of open covers.
e.g., from this, it is immediate that manifolds are paracompact: (1) locally Euclidean ⇒ locally compact. (2) Second-countable ⇒ Lindelof. (3) Lindelof & locally compact ⇒ σ-compact. (1) & (2) & (3) + above ⇒ manifolds are paracompact. From which other properties of manifolds immediately follow from that of paracompactness, eg manifolds always admit a partition of unity subordinate to any open cover.
But rabbit hole: recall, open sets axiomatize semidecidable properties. What is, then, the logical interpretation of compactness, σ-compactness, local compactness, paracompactness?
This week, I’ll start tracking the exercises I solve and pages I cover and post them in next week’s shortform (EDIT: biweekly), so that I can keep track of my progress + additional accountability.
I am self-studying math. The purpose of this shortform is to publicly write down:
things I’ve learned each week,
my progress through the textbooks I’ve committed to read[2], and
other learning-related reflections,
with the aim of:
allocating some time each week reflecting on what I’ve learned to self-improve,
be kept socially accountable by publicly committing on making progress and actually sticking through the textbooks,
and write / think in public which I enjoy doing so.
I am currently reading the following textbooks:
[Lee]: Lee, Smooth Manifolds
[Bredon]: Bredon, Topology and Geometry
[FOAG]: Vakil, Foundations of Algebraic Geometry
[Procesi]: Procesi, Lie Groups: An Approach through Invariants and Representations
and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week’s shortform I haven’t written down my progress this week on each of these books nor the problems I’ve solved because I haven’t started tracking them, so I’ll do them starting next week.
Woit’s “Quantum Theory, Groups and Representations” is fantastic for this IMO. It gives physical motivation for representation theory, connects it to invariants and, of course, works through the physically important lie-groups. The intuitions you build here should generalize. Plus, it’s well written.
Also, if you are ever in the market for differential topology, algebraic topology, and algebraic geometry, then I’d recommend Ronald Brown’s “Topology and Groupoids.” It presents the basic material of topology in a way that generalizes better to the fields above, along with some powerful geometric tools for calculations.
Both author’s provide free pdfs of their books.
Thanks for the recommendation! Woit’s book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg’s Group Theory and Physics to be a good representation theory & physics book.
I did encounter Brown’s book during my search for algebraic topology books but I had to pass it over Bredon’s because it didn’t develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.
No worries! For more recommendations like those two, I’d suggest having a look at “The Fast Track” on Sheafification. Of the books I’ve read from that list, all were fantastic. Note that site emphasises mathematics relevant for physics, and vice versa, so it might not be everyone’s cup of tea. But given your interests, I think you’ll find it useful.
Perhaps I should
one day in the far far futurewrite a sequence on bayes nets.Some low-effort TOC (this is basically mostly koller & friedman):
why bayesnets and markovnets? factorized cognition, how to do efficient bayesian updates in practice, it’s how our brain is probably organized, etc. why would anyone want to study this subject if they’re doing alignment research? explain philosophy behind them.
simple examples of bayes nets. basic factorization theorems (the I-map stuff and separation criterion)
tangent on why bayes nets aren’t causal nets, though Zack M Davis had a good post on this exact topic, comment threads there are high insight
how inference is basically marginalization (basic theorems of: a reduced markov net represents conditioning, thus inference upon conditioning is the same as marginalization on a reduced net)
why is marginalization hard? i.e. NP-completeness of exact and approximate inference worst-case
what is a workaround? solve by hand simple cases in which inference can be greatly simplified by just shuffling in the order of sums and products, and realize that the exponential blowup of complexity is dependent on a graphical property of your bayesnet called the treewidth
exact inference algorithms (bounded by treewidth) that can exploit the graph structure and do inference efficiently: sum-product / belief-propagation
approximate inference algorithms (works in even high treewidth! no guarantee of convergence) - loopy belief propagation, variational methods, etc
connections to neuroscience: “the human brain is just doing belief propagation over a bayes net whose variables are the cortical column” or smth, i just know that there is some connection
Formalizing selection theorems for abstractability
Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.
Recall, an abstracting causal model (ACM)—exact transformations, τ-abstractions, and approximations—is a map between two structural causal models satisfying certain requirements that lets us reasonably say one is an abstraction, or a high-level causal model of another.
Broadly speaking, the condition is a sort of causal consistency requirement. It’s a commuting diagram that requires the “high-level” interventions to be consistent with various “low-level” ways of implementing that intervention. Approximation errors talk about how well the diagram commutes (given that the support of the variables in the high-level causal model is equipped with some metric)
Now consider a curve: x-axis is the node count, and y-axis is the minimum approximation error of ACMs of the original system with that node count (subject to some conditions[1]). It would hopefully an decreasing one[2].
This curve would represent the abstractability of a system. Lower the curve, the more abstractable it is.
Aside: we may conjecture that natural systems will have discrete jumps, corresponding to natural modules. The intuition being that, eg if we have a physics model of two groups of humans interacting, in some sense 2 nodes (each node representing the human-group) and 4 nodes (each node representing the individual-human) are the most natural, and 3 nodes aren’t (perhaps the 2 node system with a degenerate node doing ~nothing, so it would have very similar approximation scores with the 2 node case).
Then, try hard to prove a selection theorem of the following form: given low-level causal model satisfying certain criteria (eg low regret over varying objectives, connection costs), the abstractability curve gets pushed further downwards. Or conversely, find conditions that make this true.
I don’t know how to prove this[3], but at least this gets closer to a well-defined mathematical problem.
I’ve been thinking about this for an hour now and finding the right definition here seems a bit non-trivial. Obviously there’s going to be an ACM of zero approximation error for any node count, just have a single node that is the joint of all the low-level nodes. Then the support would be massive, so a constraint on it may be appropriate.
Or instead we could fold it in to the x-axis—if there is perhaps a non ad-hoc, natural complexity measure for Bayes Nets that capture [high node counts ⇒ high complexity because each nodes represent stable causal mechanisms of the system, aka modules] and [high support size ⇒ high complexity because we don’t want modules that are “contrived” in some sense] as special cases, then we could use this as the x-axis instead of just node count.
Immediate answer: Restrict this whole setup into a prediction setting so that we can do model selection. Require on top of causal consistency that both the low-level and high-level causal model have a single node whose predictive distribution are similar. Now we can talk about eg the RLCT of a Bayes Net. I don’t know if this makes sense. Need to think more.
Or rather, find the appropriate setup to make this a decreasing curve.
I suspect closely studying the robust agents learn causal world models paper would be fruitful, since they also prove a selection theorem over causal models. Their strategy is to (1) develop an algorithm that queries an agent with low regret to construct a causal model, (2) prove that this yields an approximately correct causal model of the data generating model, (3) then arguing that this implies the agent must internally represent something isomorphic to a causal world model.
A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following:
p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes T G-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
Any thoughts on how to customize LessWrong to make it LessAddictive? I just really, really like the editor for various reasons, so I usually write a bunch (drafts, research notes, study notes, etc) using it but it’s quite easy to get distracted.
You could use the ad & content blocker uBlock Origin to zap any addictive elements of the site, like the main page feed or the Quick Takes or Popular Comments. Then if you do want to access these, you can temporarily turn off uBlock Origin.
Incidentally, uBlock Origin can also be installed on mobile Firefox, and you can manually sync its settings across devices.
Maybe make a habit of blocking
https://www.lesswrong.com/posts/*while writing?moments of microscopic fun encountered while studying/researching:
Quantum mechanics call vector space & its dual bra/ket because … bra-c-ket. What can I say? I like it—But where did the letter ‘c’ go, Dirac?
Defining cauchy sequences and limits in real analysis: it’s really cool how you “bootstrap” the definition of Cauchy sequences / limit on real using the definition of Cauchy sequences / limit on rationals. basically:
(1) define Cauchy sequence on rationals
(2) use it to define limit (on rationals) using rational-Cauchy
(3) use it to define reals
(4) use it to define Cauchy sequence on reals
(5) show it’s consistent with Cauchy sequence on rationals in both directions
a. rationals are embedded in reals hence the real-Cauchy definition subsumes rational-Cauchy definition
b. you can always find a rational number smaller than a given real number hence a sequence being rational-Cauchy means it is also real-Cauchy)
(6) define limit (on reals)
(7) show it’s consistent with limit on rationals
(8) … and that they’re equivalent to real-Cauchy
(9) proceed to ignore the distinction b/w real-Cauchy/limit and their rational counterpart. Slick!
(will probably keep updating this in the replies)
Maybe he dropped the “c” because it changes the “a” phoneme from æ to ɑː and gives a cleaner division in sounds: “brac-ket” pronounced together collides with “bracket” where “braa-ket” does not.
Any advice on reducing neck and shoulder pain while studying? For me that’s my biggest blocker to being able to focus longer (especially for math, where I have to look down at my notes/book for a long period of time). I’m considering stuff like getting a standing desk or doing regular back/shoulder exercises. Would like to hear what everyone else’s setups are.
Train skill of noticing tension and focus on it. Tends to dissolve. No that’s not so satisfying but it works. Standing desk can help but it’s just not that comfortable for most.
weight training?
I still have lots of neck and shoulder tension, but the only thing I’ve found that can reliably lessen it is doing some hard work on a punching bag for about 20 minutes every day, especially hard straights and jabs with full extension.
I’ve used Pain Science in the past as a resource and highly, highly endorse it. Here is an article they have on neck pain.
(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
(I need to read more literature)
Becoming Stronger™ (Sep 28 - Oct 12)
Notes and reflections on the things I’ve learned while Doing Scholarship the last two week (i.e. studying math).
Mostly the past two weeks were on differential geometry (Lee):
Ch 4 (Submersion, Immersion, Embedding) comments:
Conceptually, by the Constant rank theorem, constant rank maps (smooth maps whose differential dFp:TpM→TF(p)N is constant rank at all p) are precisely the maps with a linear local coordinate representation (thus are maps well-modeled locally by its differentials).
Basically a nonlinear version of the linear algebra theorem that any square matrix can be expressed as [Ir000]. The proof is much more complicated however: basically a clever choice of coordinate transformation via the inverse function theorem.
The point of the chapter is to come up with various characterizations of submersion, immersion, embedding. For example, 1) smooth immersion iff locally smooth embedding, 2) smooth submersion iff every point is an image of a local section, 3) surjective maps ⇒ submersion & injective ⇒ immersion …
The proof of 3) is a very cool application of the Baire category theorem. Baire category theorem says the countable union of nowhere dense sets has empty interior; this is not very motivating, but reading Bredon[1] helped clarify its conceptual significance.
Namely, consider the more illuminating contrapositive statement: countable intersection of dense open sets is dense. Conceptually, the space is some configuration space, and dense open sets represent configurations that satisfy certain generically satisfied constraints (polynomial p(x) being nonzero is a prototypical example, which is a dense & open set). Then, the question is whether the property of a countable number of these constraints being satisfied at the same time is still generic, i.e. dense. The Baire category theorem says this is indeed the case (for locally compact Hausdorff spaces).
Sections are just right inverses, and their intuitive geometric content was a bit confusing until I read the wikipedia page: a section of f is an abstraction of a graph by viewing f as a sort of “projection map.” That makes sense! I’m sure this will come up later in the fiber bundle context.
The “figure-eight curve” and “dense torus map” as prototypical examples of smooth immersions that isn’t a smooth embedding, due to topological considerations.
Ch 5 (Submanifold) comments:
Similar to Ch 4, many useful characterizations of submanifolds and how to generate them. eg embedded submanifold iff locally a “slice” of the ambient manifold’s coordinate chart. embedded submanifold iff image of smooth embedding, immersed submanifold iff image of smooth immersion. Level sets of a smooth map at a “regular value” are embedded submanifolds …
Ch 6 (Sard’s theorem) comments:
Finally, one of the more fun chapters! Finally learned the proof of the Whitney embedding / immersion theorem that I’ve heard a lot about.
The compact case of the Whitney embedding theorem is much more conceptually straightforward:
Given a m (finite, possible since compact) chart of the n-dim manifold, literally just adjoin them while multiplying them with appropriate partitions of unity to get a M→Rnm map, and adjoin the m partitions of unity (a “chart indicator variable”) to get a M→Rnm+m map. This turns out to be an immersion, and thus an embedding since M is compact.
Apply the projection map RN→RN−1 with a 1-dim kernel Rv. By Sard’s theorem, this turns out to be an immersion (when restricted to M) for almost any choice of v, as long as N>2n+1. Repeatedly apply this to the massive codomain M→Rnm+m to get an immersion to R2n+1.
This projection map can in fact be promoted to an embedding, given that the original immersion of M to R^n is an embedding.
High-level takeaways:
The most dumb and obvious way of interpolating coordinate charts into a global map via partitions of unity, with slight modifications, gives a bona fide immersion of a manifold into RN!!
It was interesting to learn that there was a 1-2 decade period of foundational uncertainty (between the first proposal of the abstract manifold definition and Whitney’s above proof) where people didn’t know whether the abstract manifold definition was actually more general than RN or not.[2]
Partitions of unity really is used everywhere. I wonder how the theory of complex analytic manifolds ever do anything when analytic partitions of unity don’t exist.
Proof strategy of promoting a smooth map to a proper map (at the cost of increased dimensionality of the codomain) by literally adjoining a proper map next to it. Clever!
I presume this is the main motivation behind exhaustion functions (f:M→R s.t.f−1((−∞,c]) is compact ∀c∈R). It’s a proper map, it exists for any manifolds (again, shown by partitions of unity), and has codomain of dimension 1 so it minimally increases the function codomain dimension.
More applications on Whitney approximation theorems and transversality arguments.
The latter, including the transversality homotopy theorem (actually learned this a year ago in my difftop class, though that class used Guillemin’s book where manifolds are always embedded in RN - so it’s good to learn them from a more intrinsic perspective) is very interesting.
It also ties to one of my motivation for all this math learning, backchaining from trying to do good alignment theory work, which is learning the math of structural stability and its role in the theory of forms (morphogenesis) cf Thom, Structural Stability and Morphogenesis (thank you Dan Murfet for explaining this perspective).
Rabbit holes that I could not afford to pursue:
The category of smooth manifolds is an idempotent-splitting completion of the category of open subspaces of findim cartesian spaces?!?!?! My mind is blown.
So much more elegant than the standard definition via charts and maximal smooth structures and such. Unsure of the utility of this characterization though, lol (read Lawvere’s paper).
There is a duality between the category of smooth manifolds and the category of R-algebras. Fascinating how such dualities between algebra and geometry seem to be a very common motif throughout different fields, I’m sure this will come up in Vakil’s book later. Also curious about Gelfand’s duality on this for topological spaces.
“It is better to have a good category with bad objects than a bad category with good objects.”—Grothendieck (probably not). For example, the category of smooth manifolds is not nice, motivating smooth sets, diffeological spaces, and so on.
Dichotomy between nice objects and nice categories: in the context of alignment theory, maybe I can view Programs as Singularities as enlarging an instantiation of this idea by enlarging the class of Turing machines.
I found this intuition for adjoint functors illuminating. Specifically, note set maps f:X→Y and g:Y→X being inverses are equivalent to the condition that their graphs are mirrored along the diagonal, i.e. (x,f(x))=(g(y),y). Rephrase this using Kronecker delta, δ(x,g(y))=δ(f(x),y). Now δ can be seen as expressing a “relation” that could be exhibited by two elements of a set, i.e. equality (1) or inequality (0). But in general categories, objects can exhibit more relations—so replace δ by Hom - you get adjoint functors!
Example of how reading books in parallel improves learning efficiency.
Why that long? The dimensionality reduction by projection is perhaps more nontrivial because of Sard, but the obvious gluing should have been sufficient to construct an immersion at least, albeit at the cost of inefficient codomain dimension. Maybe the historically difficult part was the concept of partition of unity and that it always exist in manifolds?
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I find this paper very enlightening from a conceptual perspective!
I’ve tried to think of problems that needed to be solved before we can actually implement this on real systems—both conceptual and practical—on approximate order of importance.
There are no ‘dynamics,’ no learning. As soon as a mechanism node is edited, it is assumed that agents immediately change their ‘object decision variable’ (a conditional probability distribution given its object parent nodes) to play the subgame equilibria.
Assumption of factorization of variables into ‘object’ / ‘mechanisms,’ and the resulting subjectivity. The paper models the process by which an agent adapts its policy given changes in the mechanism of the environment via a ‘mechanism decision variable’ (that depends on its mechanism parent nodes), which modulates the conditional probability distribution of its child ‘object decision variable’, the actual policy.
For example, the paper says a learned RL policy isn’t an agent, because interventions in the environment won’t make it change its already-learned policy—but that a human or a RL policy together with its training process is an agent, because it can adapt. Is this reasonable?
Say I have a gridworld RL policy that’s learned to get cheese (3 cell world, cheese always on left) by always going to the left. Clearly it can’t change its policy when I change the cheese distribution to favor right, so it seems right to call this not an agent.
Now, say the policy now has sensory access to the grid state, and correctly generalized (despite only being trained on left-cheese) to move in the direction where it sees the cheese, so when I change the cheese distribution, it adapts accordingly. I think it is right to call this an agent?
Now, say the policy is an LLM agent (static weight) on an open world simulation which reasons in-context. I just changed the mechanism of the simulation by lowering the gravity constant, and the agent observes this, reasons in-context, and adapts its sensorimotor policy accordingly. This is clearly an agent?
I think this is because the paper considers, in the case of the RL policy alone, the ‘object policy’ to be the policy of the trained neural network (whose induced policy distribution is definitionally fixed), and the ‘mechanism policy’ to be a trivial delta function assigning the already-trained object policy. And in the case of the RL policy together with its training process, the ‘mechanism policy’ is now defined as the training process that assigns the fully-trained conditional probability distribution to the object policy.
But what if the ‘mechanism policy’ was the in-context learning process by which it induces an ‘object policy’? Then changes in the environment’s mechanism can be related to the ‘mechanism policy’ and thus the ‘object policy’ via in-context learning as in the second and third example, making them count as agents.
Ultimately, the setup in the paper forces us to factorize the means-by-which-policies-adapt into mechanism vs object variables, and the results (like whether a system is to be considered an agent) depends on this factorization. It’s not always clear what the right factorization is, how to discover them from data, or if this is the right frame to think about the problem at all.
Implicit choice of variables that are convenient for agent discovery. The paper does mention that the algorithm is dependent in the choice of the variable, as in: if the node corresponding to the ‘actual agent decision’ is missing but its children is there, then the algorithm will label its children to be the decision nodes. But this is already a very convenient representation!
Prototypical example: Minecraft world with RL agents interacting represented as a coarse-grained lattice (dynamical Bayes Net?) with each node corresponding to a physical location and its property, like color. Clearly no single node here is an agent, because agents move! My naive guess is that in principle, everything will be labeled an agent.
So the variables of choice must be abstract variables of the underlying substrate, like functions over them. But then, how do you discover the right representation automatically, in a way that interventions in the abstract variable level can faithfully translate to actually performable interventions in the underlying substrate?
Given the causal graph, even the slightest satisfaction of the agency-criterion labels the nodes as decision / utility. No “degree-of-agency”—maybe by summing over the extent to which the independencies fail to satisfy?
Then different agents are defined as causally separated chunks (~connected component) of [set-of-decision-nodes / set-of-utility-nodes]. How do we accommodate hierarchical agency (like subagents), systems with different degrees of agency, etc?
The interventional distribution on the object/mechanism variables are converted into a causal graph using the obvious [perform-do()-while-fixing-everything-else] algorithm. My impression is that causal discovery doesn’t really work in practice, especially in noisy reality with a large number of variables via gazillion conditional independence tests.
The correctness proof requires lots of unrealistic assumptions, e.g., agents always play subgame equilibria, though I think some of this can be relaxed.
I am curious as to how often the asymptotic results proven using features of the problem that seem basically practically-irrelevant become relevant in practice.
Like, I understand that there are many asymptotic results (e.g., free energy principle in SLT) that are useful in practice, but i feel like there’s something sus about similar results from information theory or complexity theory where the way in which they prove certain bounds (or inclusion relationship, for complexity theory) seem totally detached from practicality?
joint source coding theorem is often stated as why we can consider the problem of compression and redundancy separately, but when you actually look at the proof it only talks about possibility (which is proven in terms of insanely long codes) and thus not-at-all trivial that this equivalence is something that holds in the context of practical code-engineering
complexity theory talks about stuff like quantifying some property over all possible boolean circuits of a given size which seems to me considering a feature of the problem just so utterly irrelevant to real programs that I’m suspicious it can say meaningful things about stuff we see in practice
as an aside, does the P vs NP distinction even matter in practice? we just … seem to have very good approximation to NP problems by algorithms that take into account the structures specific to the problem and domains where we want things to be fast; and as long as complexity methods doesn’t take into account those fine structures that are specific to a problem, i don’t see how it would characterize such well-approximated problems using complexity classes.
Wigderson’s book had a short section on average complexity which I hoped would be this kind of a result, and I’m unimpressed (the problem doesn’t sound easier—now how do you specify the natural distribution??)
P v NP: https://en.wikipedia.org/wiki/Generic-case_complexity
One result to mention in computational complexity is the PCP theorem which not only gives probabilistically checkable proofs but also gives approximation case hardness. Seems deep but I haven’t understood the proof yet.
Great question. I don’t have a satisfying answer. Perhaps a cynical answer is survival bias—we remember the asymptotic results that eventually become relevant (because people develop practical algorithms or a deeper theory is discovered) but don’t remember the irrelevant ones.
Existence results are categorically easier to prove than explicit algorithms. Indeed, classical existence may hold (the former) while intuitioinistically (the latter) might not. We would expect non-explicit existence results to appear before explicit algorithms.
One minor remark on ‘quantifying over all boolean algorithms’. Unease with quantification over large domains may be a vestige of set-theoretic thinking that imagines types as (platonic) boxes. But a term of a for-all quantifier is better thought of as an algorithm/ method to check the property for any given term (in this case a Boolean circuit). This doesn’t sound divorced from practice to my ears.
Yes, it does, for several reasons:
It basically is necessary to prove P != NP to get a lot of other results to work, and for some of those results, proving P != NP is sufficient.
If P != NP (As most people suspect), it fundamentally rules out solving lots of problems generally and quickly without exploiting structure, and in particular lets me flip the burden of proof to the algorithm maker to explain why their solution to a problem like SAT is efficient, rather than me having to disprove the existence of an efficient algorithm.
It’s either by exploiting structure, somehow having a proof that P=NP, or relying on new physics models that enable computing NP-complete problems efficiently, and the latter 2 need very, very strong evidence behind them.
This in particular applies to basically all learning problems in AI today.
It explains why certain problems cannot be reasonably solved optimally, without huge discoveries, and the best examples are travelling salesman problems for inability to optimally solve, as well as a whole lot of other NP-complete problems. There are also other NP problems where there isn’t a way to solve them efficiently at all, especially if FPT != W[1] holds.
Also a note that we also expect a lot of NP-complete problems to also not be solvable by fast algorithms even in the average case, which basically means it’s likely to be very relevant quite a lot of the time, so we don’t have to limit ourselves to the worst case either.
I recently learned about metauni, and it looks amazing. TL;DR, a bunch of researchers give out lectures or seminars on Roblox—Topics include AI alignment/policy, Natural Abstractions, Topos Theory, Singular Learning Theory, etc.
I haven’t actually participated in any of their live events yet and only watched their videos, but they all look really interesting. I’m somewhat surprised that there hasn’t been much discussion about this on LW!
Complaint with Pugh’s real analysis textbook: He doesn’t even define the limit of a function properly?!
It’s implicitly defined together with the definition of continuity where ∀ϵ>0∃δ>0|x−x0|<δ⟹|f(x)−f(x0)|<ϵ, but in Chapter 3 when defining differentiability he implicitly switches the condition to 0<|x−x0|<δ without even mentioning it (nor the requirement that x0 now needs to be an accumulation point!) While Pugh has its own benefits, coming from Terry Tao’s analysis textbook background, this is absurd!
(though to be fair Terry Tao has the exact same issue in Book 2, where his definition of function continuity via limit in metric space precedes that of defining limit in general … the only redeeming factor is that it’s defined rigorously in Book 1, in the limited context of R)
*sigh* I guess we’re still pretty far from reaching the Pareto Frontier of textbook quality, at least in real analysis.
… Speaking of Pareto Frontiers, would anyone say there is such a textbook that is close to that frontier, at least in a different subject? Would love to read one of those.
Maybe you should email Pugh with the feedback? (I audited his honors analysis course in fall 2017; he seemed nice.)
As far as the frontier of analysis textbooks goes, I really like how Schröder Mathematical Analysis manages to be both rigorous and friendly: the early chapters patiently explain standard proof techniques (like the add-and-subtract triangle inequality gambit) to the novice who hasn’t seen them before, but the punishing details of the subject are in no way simplified. (One wonders if the subtitle “A Concise Introduction” was intended ironically.)
I used to try out near-random search on ideaspace, where I made a quick app that spat out 3~5 random words from a dictionary of interesting words/concepts that I curated, and I spent 5 minutes every day thinking very hard on whether anything interesting came out of those combinations.
Of course I knew random search on exponential space was futile, but I got a couple cool invention ideas (most of which turned out to already exist), like:
infinite indoor rockclimbing: attach rocks to a vertical treadmill, and now you have an infinite indoor rock climbing wall (which is also safe from falling)! maybe add some fancy mechanism to add variations to the rocks + a VR headgear, I guess.
clever crypto mechanism design (in the spirit of CO2 Coin) to incentivize crowdsourcing of age-reduction molecule design animal trials from the public. (I know what you’re thinking)
You can probably do this smarter now if you wanted, with eg better GPT models.
Having lived ~19 years, I can distinctly remember around 5~6 times when I explicitly noticed myself experiencing totally new qualia with my inner monologue going “oh wow! I didn’t know this dimension of qualia was a thing.” examples:
hard-to-explain sense that my mind is expanding horizontally with fractal cube-like structures (think bismuth) forming around it and my subjective experience gliding along its surface which lasted for ~5 minutes after taking zolpidem for the first time to sleep (2 days ago)
getting drunk for the first time (half a year ago)
feeling absolutely euphoric after having a cool math insight (a year ago)
...
Reminds me of myself around a decade ago, completely incapable of understanding why my uncle smoked, being “huh? The smoke isn’t even sweet, why would you want to do that?” Now that I have [addiction-to-X] as a clear dimension of qualia/experience solidified in myself, I can better model their subjective experiences although I’ve never smoked myself. Reminds me of the SSC classic.
Also one observation is that it feels like the rate at which I acquire these is getting faster, probably because of increase in self-awareness + increased option space as I reach adulthood (like being able to drink).
Anyways, I think it’s really cool, and can’t wait for more.
I observed new visual qualia of colors while using some light machine.
Also, when I first came to Italy, I have a feeling as if the whole rainbow of color qualia changed
Sunlight scattered by the atmosphere on cloudless mornings during the hour before sunrise inspires a subtle feeling (“this is cool, maybe even exciting”) that I never noticed till I started intentionally exposing myself to it for health reasons (specifically, making it easier to fall asleep 18 hours later).
More precisely, I might or might not have noticed the feeling, but if I did notice it, I quickly forgot about it because I had no idea how to reproduce it.
I have to get away from artificial light (streetlamps) (and from direct (yellow) sunlight) for the (blue) indirect sunlight to have this effect. Also, it is no good looking at a small patch of sky, e.g., through a window in a building: most or all of the upper half of my field of vision must be receiving this indirect sunlight. (The intrinsically-photosensitive retinal ganglion cells are all over the bottom half of the retina, but absent from the top half.)
To me, the fact that the human brain basically implements SSL+RL is very very strong evidence that the current DL paradigm (with a bit of “engineering” effort, but nothing like fundamental breakthroughs) will kinda just keep scaling until we reach point-of-no-return. Does this broadly look correct to people here? Would really appreciate other perspectives.
I mostly think “algorithms that involve both SSL and RL” is a much broader space of possible algorithms than you seem to think it is, and thus that there are parts of this broad space that require “fundamental breakthroughs” to access. For example, both AlexNet and differentiable rendering can be used to analyze images via supervised learning with gradient descent. But those two algorithms are very very different from each other! So there’s more to an algorithm than its update rule.
See also 2nd section of this comment, although I was emphasizing alignment-relevant differences there whereas you’re talking about capabilities. Other things include the fact that if I ask you to solve a hard math problem, your brain will be different (different weights, not just different activations / context) when you’re halfway through compared to when you started working on it (a.k.a. online learning, see also here), and the fact that brain neural networks are not really “deep” in the DL sense. Among other things.
Makes sense. I think we’re using the terms differently in scope. By “DL paradigm” I meant to encompass the kind of stuff you mentioned (RL-directing-SS-target (active learning), online learning, different architecture, etc) because they really seemed like “engineering challenges” to me (despite them covering a broad space of algorithms) in the sense that capabilities researchers already seem to be working on & scaling them without facing any apparent blockers to further progress, i.e. in need of any “fundamental breakthroughs”—by which I was pointing more at paradigm shifts away from DL like, idk, symbolic learning.
I have a slightly different takeaway. Yes techniques similar to current techniques will most likely lead to AGI but it’s not literally ‘just scaling LLMs’. The actual architecture of the brain is meaningfully different from what’s being deployed right now. So different in one sense. On the other hand it’s not like the brain does something completely different and proposals that are much closer to the brain architecture are in the literature (I won’t name them here...). It’s plausible that some variant on that will lead to true AGI. Pure hardware scaling obviously increases capabilities in a straightforward way but a transformer is not a generally intelligent agent and won’t be even if scaled many more OOMs.
(I think Steven Byrnes has a similar view but I wouldn’t want to misrepresent his views)
So far as I can tell, a transformer has three possible blockers (that would need to stand undefeated together): (1) in-context learning plateauing at a level where it’s not able to do even a little bit of useful work without changing model weights, (2) terrible sample efficiency that asks for more data than is available on new or rare/situational topics, and (3) absence of a synthetic data generation process that’s both sufficiently prolific and known not to be useless at that scale.
A need for online learning and terrible sample efficiency are defeated by OOMs if enough useful synthetic data can be generated, which the anemic in-context learning without changing weights might turn out to be sufficient for. This is the case of defeating (3), with others falling as a result.
Another possibility is that much larger multimodal transformers (there is a lot of video) might suffice without synthetic data if a model learns superintelligent in-context learning. SSL is not just about imitating humans, the problems it potentially becomes adept at solving are arbitrarily intricate. So even if it can’t grow further and learn substantially new things within its current architecture/model, it might happen to already be far enough along at inference time to do the necessary redesign on its own. This is the case of defeating (1), leaving it to the model to defeat the others. And it should help with (3) even at non-superintelligent levels.
Failing that, RL demonstrates human level sample efficiency in increasingly non-toy settings, promising that saner amounts of useful synthetic data might suffice, defeating (2), though at this point it’s substantially not-a-transformer.
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-training with some frozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.
I wonder if the following is possible to study textbooks more efficiently using LLMs:
Feed the entire textbook to the LLM and produce a list of summaries that increases in granularity and length, covering all the material in the textbook just at a different depth (eg proofs omitted, further elaboration on high-level perspectives, etc)
The student starts from the highest-level summary, and gradually moves to the more granular materials.
When I study textbooks, I spend a significant amount of time improving my mental autocompletion, like being able to familiarize myself with the terminologies, which words or proof-style usually come in which context, etc. Doing this seems to significantly improve my ability to read eg long proofs, since I can ignore all the pesky details (which I can trust my mental autocompletion to later fill in the details if needed) and allocate my effort in getting a high-level view of the proof.
Textbooks don’t really admit this style of learning, because the students don’t have prior knowledge of all the concept-dependencies of a new subject they’re learning, and thus are forced to start at the lowest-level and make their way up to the high-level perspective.
Perhaps LLMs will let us reverse this direction, instead going from the highest to the lowest.
What’s a good technical introduction to Decision Theory and Game Theory for alignment researchers? I’m guessing standard undergrad textbooks don’t include, say, content about logical decision theory. I’ve mostly been reading posts on LW but as with most stuff here they feel more like self-contained blog posts (rather than textbooks that build on top of a common context) so I was wondering if there was anything like a canonical resource providing a unified technical / math-y perspective on the whole subject.
The MIRI Research Guide recommends An Introduction to Decision Theory and Game Theory: An Introduction. I have read neither and am simply relaying the recommendation.
i absolutely hate bureaucracy, dumb forms, stupid websites etc. like, I almost had a literal breakdown trying to install Minecraft recently (and eventually failed). God.
I think what’s so crushing about it, is that it reminds me that the wrong people are designing things, and that they wont allow them to be fixed, and I can only find solace in thinking that the inefficiency of their designs is also a sign that they can be defeated.
God, I wish real analysis was at least half as elegant as any other math subject — way too much pathological examples that I can’t care less about. I’ve heard some good things about constructivism though, hopefully analysis is done better there.
Yeah, real analysis sucks. But you have to go through it to get to delightful stuff— I particularly love harmonic and functional analysis. Real analysis is just a bunch of pathological cases and technical persnicketiness that you need to have to keep you from steering over a cliff when you get to the more advanced stuff. I’ve encountered some other subjects that have the same feeling to them. For example, measure-theoretic probability is a dry technical subject that you need to get through before you get the fun of stochastic differential equations. Same with commutative algebra and algebraic geometry, or point-set topology and differential geometry.
Constructivism, in my experience, makes real analysis more mind blowing, but also harder to reason about. My brain uses non-constructive methods subconsciously, so it’s hard for me to notice when I’ve transgressed the rules of constructivism.
As a general reflection on undergraduate mathematics imho there is way too much emphasis on real analysis. Yes, knowing how to be rigorous is important, being aware of pathological counterexample is importanting, and real analysis is used all over the place. But there is so much more to learn in mathematics than real analysis and the focus on minor technical issues here is often a distraction to developing a broad & deep mathematical background.
For most mathematicians (and scientists using serious math) real analysis is a only a small part of the toolkit. Understanding well the different kinds of limits can ofc be crucial in functional analysis, stochastic processes and various parts of physics. But there are so many topics that are important to know and learn here!
The reason it is so prominent in the undergraduate curriculum seems to be more tied to institutional inertia, its prominence on centralized exams, relation with calculus, etc
Really, what’s going on is that in the general case, as mathematics is asked to be more and more general, you will start encountering pathological examples more, and paying attention to detail more is a valuable skill in both math and real life.
And while being technical about the pathological cases is kind of annoying, it’s also one that actually matters in real life, as you aren’t guaranteed to have an elegant solution to your problems.
Update: huh, nonstandard analysis is really cool. Not only are things much more intuitive (by using infinitesimals from hyperreals instead of using epsilon-delta formulation for everything), by the transfer principle all first order statements are equivalent between standard and nonstandard analysis!
There were various notions/frames of optimization floating around, and I tried my best to distill them:
Eliezer’s Measuring Optimization Power on unlikelihood of outcome + agent preference ordering
Alex Flint’s The ground of optimization on robustness of system-as-a-whole evolution
Selection vs Control as distinguishing different types of “space of possibilities”
Selection as having that space explicitly given & selectable numerous times by the agent
Control as having that space only given in terms of counterfactuals, and the agent can access it only once.
These distinctions correlate with the type of algorithm being used & its internal structure, where Selection uses more search-like process using maps, while Control may just use explicit formula … although it may very well use internal maps to Select on counterfactual outcomes!
In other words, the Selection vs Control may very well be viewed as a different cluster of Analysis. Example:
If we decide to focus our Analysis of “space of possibilities” on eg “Real life outcome,” then a guided missile is always Control.
But if we decide to focus on “space of internal representation of possibilities,” then a guided missle that uses internal map to search on becomes Selection.
“Internal Optimization” vs “External Optimization”
Similar to Selection vs Control, but the analysis focuses more on internal structure:
Why? Motivated by the fact that, as with the guided missile example, Control systems can be viewed as Selection systems depending on perspective
… hence, better to focus on internal structures where it’s much less ambiguous.
IO: Internal search + selection
EO: Flint’s definition of “optimizing system”
IO is included in EO, if we assume accurate map-to-environment correspondence.
To me, this doesn’t really get at what the internals of actually-control-like systems look like, which presumably a subset of EO—IO.
Search-in-Territory vs Search-in-Map
Greater emphasis on internal structure—specifically, “maps.”
Maps are capital investment, allowing you to be able to optimize despite not knowing what to exactly optimize for (by compressing info)
I have several thoughts on these framings, but one trouble is the excessive usage of words to represent “clusters” i.e. terms to group a bunch of correlated variables. Selection vs Control, for example, doesn’t have a clear definition/criteria but rather points at a number of correlated things, like internal structure, search, maps, control-like things, etc.
Sure, deconfusing and pointing out clusters is useful because clusters imply correlations and correlations perhaps imply hidden structure + relationships—but I think the costs from cluster-representing-words doing hidden inference is much greater than the benefits, and it would be better to explicitly lay out the features-of-clusters that the one is referring to instead of just using the name of the cluster.
This is similar to the trouble I had with “wrapper-minds,” which is yet another example of a cluster pointing at a bunch of correlated variables, and people using the same term to mean different things.
Anyways, I still feel totally confused about optimization—and while these clusters/frames are useful, I think thinking in terms of them would ensue even more confusion within myself. It’s probably better to take the useful individual parts within the cluster and start deconfusing from the ground-up using those as the building blocks.
I find the intersection of computational mechanics, boundaries/frames/factored-sets, and some works from the causal incentives group—especially discovering agents and robust agents learn causal world model (review) - to be a very interesting theoretical direction.
By boundaries, I mean a sustaining/propagating system that informationally/causally insulates its ‘viscera’ from the ‘environment,’ and only allows relatively small amounts of deliberate information flow through certain channels in both directions. Living systems are an example of it (from bacteria to humans). It doesn’t even have to be a physically distinct chunk of spacetime, they can be over more abstract variables like societal norms. Agents are an example of it.
I find them very relevant to alignment especially from the direction of detecting such boundary-possessing/agent-like structures embedded in a large AI system and backing out a sparse relationship between these subsystems, which can then be used to e.g., control the overall dynamic. Check out these posts for more.
A prototypical deliverable would be an algorithm that can detect such ‘boundaries’ embedded in a dynamical system when given access to some representation of the system, performs observations & experiments and returns a summary data structure of all the ‘boundaries’ embedded in a system and their desires/wants, how they game-theoretically relate to one another (sparse causal relevance graph?), the consequences of interventions performed on them, etc—that’s versatile enough to detect e.g., gliders embedded in Game of Life / Particle Lenia, agents playing Minecraft while only given coarse grained access to the physical state of the world, boundary-like things inside LLMs, etc. (I’m inspired by this)
Why do I find the aforementioned directions relevant to this goal?
Critch’s Boundaries operationalizes boundaries/viscera/environment as functions of the underlying variable that executes policies that continuously prevents information ‘flow’ [1] between disallowed channels, quantified via conditional transfer entropy.
Relatedly, Fernando Rosas’s paper on Causal Blankets operationalize boundaries using a similar but subtly different[2] form of mutual information constraint on the boundaries/viscera/environment variables than that of Critch’s. Importantly, they show that such blankets always exist between two coupled stochastic processes (using a similar style of future morph equivalence relation characterization from compmech, and also a metric they call “synergistic coefficient” that quantifies how boundary-like this thing is.[3]
More on compmech, epsilon transducers generalize epsilon machines to input-output processes. PALO (Perception Action Loops) and Boundaries as two epsilon transducers coupled together?
These directions are interesting, but I find them still unsatisfactory because all of them are purely behavioral accounts of boundaries/agency. One of the hallmarks of agentic behavior (or some boundary behaviors) is adapting ones policy if an intervention changes the environment in a way that the system can observe and adapt to.[4][5]
(is there an interventionist extension of compmech?)
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I think the paper is very enlightening from a conceptual perspective, but there are many problems yet to be solved before we can actually implement this. Here’s my take on it.
More fundamentally, (this is more vibes, I’m really out of my depth here) I feel there is something intrinsically limiting with the use of Bayes Nets, especially with the fact that choosing which variables to use in your Bayes Net already encodes a lot of information about the specific factorization structure of the world. I heard good things about finite factored sets and I’m eager to learn more about them.
Not exactly a ‘flow’, because transfer entropy conflates between intrinsic information flow and synergistic information—a ‘flow’ connotes only the intrinsic component, while transfer entropy just measures the overall amount of information that a system couldn’t have obtained on its own. But anyways, transfer entropy seems like a conceptually correct metric to use.
Specifically, Fernando’s paper criticizes blankets of the following form (V for viscera, A and P for active/passive boundaries, E for environment):
Vt→At,Pt→Et
DIP implies I(Vt;At,Pt)≥I(Vt;Et)
This clearly forbids dependencies formed in the past that stays in ‘memory’.
but Critch instead defines boundaries as satisfying the following two criteria:
Vt+1,At+1→Vt,At,Pt→Et (infiltration)
DIP implies I(Vt;At,Pt)≥I(Vt;Et)
Et+1,Pt+1→At,Pt,Et→Vt (exfiltration)
DIP implies I(Vt+1,At+1;At,Pt)≥I(Vt+1,At+1;Et)
and now that the independencies are entangled across different t, there is no longer a clear upper bound on I(Vt;Et), so I don’t think the criticisms apply directly.
My immediate curiosities are on how these two formalisms relate to one another. e.g., Which independency requirements are more conceptually ‘correct’? Can we extend the future-morph construction to construct Boundaries for Critch’s formalism? etc etc
For example, a rock is very goal-directed relative to ‘blocking-a-pipe-that-happens-to-exactly-match-its-size,’ until one performs an intervention on the pipe size to discover that it can’t adapt at all.
Also, interventions are really cheap to run on digital systems (e.g., LLMs, cellular automata, simulated environments)! Limiting oneself to behavioral accounts of agency would miss out on a rich source of cheap information.
Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?
I’ve been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I’ve been assigning less premium on axiomatic conceptual definitions than i used to:
decision theoretic justification of probability > Cox’s theorem
shannon entropy as min description length > three information axioms
fernando’s operational definition of synergistic information > rest of the literature with its countless non-operational PID measures
The basis of comparison would be its usefulness and ease-of-generalization to better concepts:
at least in the case of fernando’s synergistic information, it seems far more useful because i at least know what i’m exactly getting out of it, unlike having to compare between the axiomatic definitions based on handwavy judgements.
for ease of generalization, the problem with axiomatic definitions is that there are many logically equivalent ways to state the initial axiom (from which they can then be relaxed), and operational motivations seem to ground these equivalent characterizations better, like logical inductors from the decision theoretic view of probability theory
(obviously these two feed into each other)
I’m not sure what you mean by operational vs axiomatic definitions.
But Shannon was unaware of the usage of S=−Σi pi ln pi in statistical mechanics. Instead, he was inspired by Nyquist and Hartley’s work, which introduced ad-hoc definitions of information in the case of constant probability distributions.
And in his seminal paper, “A mathematical theory of communication”, he argued in the introduction for the logarithm as a measure of information because of practicality, intuition and mathematical convenience. Moreover, he explicitly derived the entropy of a distribution from three axioms:
1) that it be continuous wrt. the probabilities,
2) that it increase monotonically for larger systems w/ constant probability distributions,
3) and that it be a weighted sum the entropy of sub-systems.
See section 6 for more details.
I hope that answers your question.
‘Symmetry’ implies ‘redundant coordinate’ implies ‘cyclic coordinates in your Lagrangian / Hamiltonian’ implies ‘conservation of conjugate momentum’
And because the action principle (where the true system trajectory extremizes your action, i.e. integral of Lagrangian) works in various dynamical systems, the above argument works in non-physical dynamical systems.
Thus conserved quantities usually exist in a given dynamical system.
mmm, but why does the action principle hold in such a wide variety of systems though? (like how you get entropy by postulating something to be maximized in an equilibrium setting)
Mildly surprised how some verbs/connectives barely play any role in conversations, even in technical ones. I just tried directed babbling with someone, and (I think?) I learned quite a lot about Israel-Pakistan relations with almost no stress coming from eg needing to make my sentences grammatically correct.
Example of (a small part of) my attempt to summarize my understanding of how Jews migrated in/out of Jerusalem over the course of history:
Could you explain more what you mean by this?
My (completely amateur) understanding is that the “extra” semantic and syntactic structure of written and spoken language does two things.
One, it adds redundancy and reduces error. Simple example, gendered pronouns mean that when you hear “Have you seen Laurence? She didn’t get much sleep last night.” you have a chance to ask the speaker for clarification and catch if they had actually said “Laura” and you misheard.
Two, it can be used as a signal. The correct use of jargon is used by listeners or readers as a proxy for competence. Or many typos in your text will indicate to readers that you haven’t put much effort into what you’re saying.
Why haven’t mosquitos evolved to be less itchy? Is there just not enough selection pressure posed by humans yet? (yes probably) Or are they evolving towards that direction? (they of course already evolved towards being less itchy while biting, but not enough to make that lack-of-itch permanent)
this is a request for help i’ve been trying and failing to catch this one for god knows how long plz halptbh would be somewhat content coexisting with them (at the level of houseflies) as long as they evolved the itch and high-pitch noise away, modulo disease risk considerations.
The reason mosquito bites itch is because they are injecting saliva into your skin. Saliva contains mosquito antigens, foreign particles that your body has evolved to attack with an inflammatory immune response that causes itching. The compound histamine is a key signaling molecule used by your body to drive this reaction.
In order for the mosquito to avoid provoking this reaction, they would either have to avoid leaving compounds inside of your body, or mutate those compounds so that they do not provoke an immune response. The human immune system is an adversarial opponent designed with an ability to recognize foreign particles generally. If it was tractable for organisms to reliably evolve to avoid provoking this response, that would represent a fundamental vulnerability in the human immune system.
Mosquitoe saliva does in fact contain anti-inflammatory, antihemostatic, and immunomodulatory compounds. So they’re trying! But also this means that mosquitos are evolved to put saliva inside of you when they feed, which means they’re inevitably going to expose the foreign particles they produce to your immune system.
There’s also a facet of selection bias making mosquitos appear unsuccessful at making their bites less itchy. If a mosquito did evolve to not provoke (as much of) an immune response and therefore less itching, redness and swelling, you probably wouldn’t notice they’d bitten you. People often perceive that some are prone to getting bitten, others aren’t. It may be that some of this is that some people don’t have as serious an immune response to mosquito bites, so they think they get bitten less often.
I’m sure there are several PhDs worth of research questions to investigate here—I’m a biomedical engineer with a good basic understanding of the immune system, but I don’t study mosquitos.
Because they have no reproductive advantage to being less itchy. You can kill them while they’re feeding, which is why they put lots of evolutionary effort into not being noticed. (They have an anesthetic in their saliva so you are unlikely to notice the bite.) By the time you develop the itchy bump, they’ve flown away and you can’t kill them.
There’s still some pressure, though. If the bites were permanently not itchy, then I may have not noticed that the mosquitos were in my room in the first place, and consequently would less likely pursue them directly. I guess that’s just not enough.
There’s also positive selection for itchiness. Mosquito spit contains dozens of carefully evolved proteins. We don’t know what they all are, but some of them are anticoagulants and anesthetics. Presumably they wouldn’t be there if they didn’t have a purpose. And your body, when it detects these foreign proteins, mounts a protective reaction, causing redness, swelling, and itching. IIRC, that reaction does a good job of killing any viruses that came in with the mosquito saliva. We’ve evolved to have that reaction. The itchiness is probably good for killing any bloodsuckers that don’t flee quickly. It certainly works against ticks.
Evolution is not our friend. It doesn’t give us what we want, just what we need.
I believe mosquitos do inject something to suppress your reaction to them, which is why you don’t notice bug bites until long after the bug is gone. There’s no reproductive advantage to the mosquito to extending that indefinitely.
Oh wow, that would make a ton of sense. Thanks Elizabeth!
I had something like locality in mind when writing this shortform, the context being: [I’m in my room → I notice itch → I realize there’s a mosquito somewhere in my room → I deliberately pursue and kill the mosquito that I wouldn’t have known existed without the itch]
But, again, this probably wouldn’t amount to much selection pressure, partially due to the fact that the vast majority of mosquito population exists in places where such locality doesn’t hold i.e. in an open environment.
In NZ we have biting bugs called sandflies which don’t do this—you can often tell the moment they get you.
The reason you find them itchy is because humans are selected to find them itchy most likely?
But the evolutionary timescale at which mosquitos can adapt to avoid detection must be faster than that of humans adapting to find mosquitos itchy! Or so I thought—my current boring guess is that (1) mechanisms for the human body to detect foreign particles are fairly “broad”, (2) the required adaptation from the mosquitos to evade them are not-way-too-simple, and (3) we just haven’t put enough selection pressure to make such change happen.
Yeah that would be my thinking as well.
Just noticing that the negation of a statement exists is enough to make meaningful updates.
e.g. I used to (implicitly) think “Chatbot Romance is weird” without having evaluated anything in-depth about the subject (and consequently didn’t have any strong opinions about it)—probably as a result of some underlying cached belief.
But after seeing this post, just reading the title was enough to make me go (1) “Oh! I just realized it is perfectly possible to argue in favor of Chatbot Romance … my belief on this subject must be a cached belief!” (2) hence is probably by-default biased towards something like the consensus opinion, and (3) so I should update away from my current direction, even without reading the post.
(Note: This was a post, but in retrospect was probably better to be posted as a shortform)
(Epistemic Status: 20-minute worth of thinking, haven’t done any builder/breaker on this yet although I plan to, and would welcome any attempts in the comment)
Have an algorithmic task whose input/output pair could (in reasonable algorithmic complexity) be generated using highly specific combination of modular components (e.g., basic arithmetic, combination of random NN module outputs, etc).
Train a small transformer (or anything, really) on the input/output pairs.
Take a large transformer that takes the activation/weights, and outputs a computational graph.
Train that large transformer over the small transformer, across a diverse set of such algorithmic tasks (probably automatically generated) with varying complexity. Now you have a general tool that takes in a set of high-dimensional matrices and backs-out a simple computational graph, great! Let’s call it Inspector.
Apply the Inspector in real models and see if it recovers anything we might expect (like induction heads).
To go a step further, apply the Inspector to itself. Maybe we might back-out a human implementable general solution for mechanistic interpretability! (Or, at least let us build a better intuition towards the solution.)
(This probably won’t work, or at least isn’t as simple as described above. Again, welcome any builder/breaker attempts!)
People mean different things when they say “values” (object vs meta values)
I noticed that people often mean different things when they say “values,” and they end up talking past each other (or convergence only happens after a long discussion). One of the difference is in whether they contain meta-level values.
Some people refer to the “object-level” preferences that we hold.
Often people bring up the “beauty” of the human mind’s capacity for its values to change, evolve, adopt, and grow—changing mind as it learns more about the world, being open to persuasion via rational argumentation, changing moral theories, etc.
Some people include the meta-values (that are defined on top of other values, and the evolution of such values).
e.g., My “values” include my meta-values, like wanting to be persuaded by good arguments, wanting to change my moral theories when I get to know better, even “not wanting my values to be fixed”
example of this view: carado’s post on you want what you want, and one of Vanessa Cosoy’s shortform/comment (can’t remember the link)
Is there a way to convert a LessWrong sequence into a single pdf? Should ideally preserve comments, latex, footnotes, etc.
The way I do this is use the Print as PDF functionality in the browser on every single post, and then concatenate them using
pdfunite.I don’t know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)
For some reason the “only if” always throws me off. It reminds me of the
unlesskeyword in ruby, which is equivalent toif not, but somehow always made my brain segfault.Saying “if X then Y” generally is equivalent to “X is sufficient for Y”, “Y is necessary for X”, “X only if Y”.
I think the interchangeability is just hard to understand. Even though I know they are the same thing, it is still really hard to intuitively see them as being equal. I personally try (but not very hard) to stick with X → Y in mathy discussions and if/only if for normal discussions
Unidimensional Continuity of Preference ≈ Assumption of “Resources”?
tl;dr, the unidimensional continuity of preference assumption in the money pumping argument used to justify the VNM axioms correspond to the assumption that there exists some unidimensional “resource” that the agent cares about, and this language is provided by the notion of “souring / sweetening” a lottery.
Various coherence theorems—or more specifically, various money pumping arguments generally have the following form:
… where “resources” (the usual example is money) are something that, apparently, these theorems assume exist. They do, but this fact is often stated in a very implicit way. Let me explain.
In the process of justifying the VNM axioms using money pumping arguments, one of the three main mathematical primitives are: (1) lotteries (probability distribution over outcomes), (2) preference relation (general binary relation), and (3) a notion of Souring/Sweetening of a lottery. Let me explain what (3) means.
Souring of A is denoted A−, and a sweetening of A is denoted A+.
A− is to be interpreted as “basically identical with A but strictly inferior in a single dimension that the agent cares about.” Based on this interpretation, we assume A>A−. Sweetening is the opposite, defined in the obvious way.
Formally, souring could be thought of as introducing a new preference relation A>uniB, which is to be interpreted as “lottery B is basically identical to lottery A, but strictly inferior in a single dimension that the agent cares about”.
On the syntactic level, such B is denoted as A−.
On the semantic level, based on the above interpretation, >uni is related to > via the following: A>uniB⟹A>B
This is where the language to talk about resources come from. “Something you can independently vary alongside a lottery A such that more of it makes you prefer that option compared to A alone” sounds like what we’d intuitively call a resource[1].
Now that we have the language, notice that so far we haven’t assumed sourings or sweetenings exist. The following assumption does it:
Which gives a more operational characterization of souring as something that lets us interpolate between the preference margins of two lotteries—intuitively satisfied by e.g., money due to its infinite divisibility.
So the above assumption is where the assumption of resources come into play. I’m not aware of any money pump arguments for this assumption, or more generally, for the existence of a “resource.” Plausibly instrumental convergence.
I don’t actually think this + the assumption below fully capture what we intuitively mean by “resources”, enough to justify this terminology. I stuck with “resources” anyways because others around here used that term to (I think?) refer to what I’m describing here.
Thinking about for some time my feeling has been that resources are about fungibility implicitly embedded in a context of trade, multiple agents (very broadly construed. E.g. an agent in time can be thought of as multiple agents cooperating intertemporally perhaps).
A resource over time has the property that I can spend it now or I can spend it later. Glibly, one could say the operational meaning of the resource arises from the intertemporal bargaining of the agent.
Perhaps it’s useful to distinguish several levels of resources and resource-like quantities.
Discrete vs continuous, tradeable / meaningful to different agents, ?? Fungibility, ?? Temporal and spatial locatedness, ?? Additivity?, submodularity ?
Addendum: another thing to consider is that the input of the vNM theorem is in some sense more complicated than the output. The output is just a utility function u: X → R, while your input is a preference order on the very infinite set of lotteries (= probability distributions ) L(X).
Thinking operationally about a preference ordering on a space of distribution is a little wacky. It means you are willing to trade off uncertain options against one another. For this to be a meaningful choice would seem to necessitate some sort of (probabilistic) world model.
Damn, why did Pearl recommend readers (in the preface of his causality book) to read all the chapters other than chapter 2 (and the last review chapter)? Chapter 2 is literally the coolest part—inferring causal structure from purely observational data! Almost skipped that chapter because of it …
it’s true it’s cool, but I suspect he’s been a bit disheartened by how complicated it’s been to get this to work in real-world settings.
in the book of why, he basically now says it’s impossible to learn causality from data, which is a bit of a confusing message if you come from his previous books.
but now with language models, I think his hopes are up again, since models can basically piggy-back on causal relationships inferred by humans
You should also check out Timeless Causality, if you haven’t done so already.
Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.
Level 0: Naive Marginalization
No dynamic programming whatsoever. Just multiply all the conditional probability distribution (CPD) tables, and sum over the variables of non-interest.
Level 1: Variable Elimination
Cache the repeated computations within a query.
For example, given a chain-structured Bayes Net A⟶B⟶C⟶D, instead of doing P(D)=∑A∑B∑CP(A,B,C,D), we can do P(D)=∑CP(D|C)∑BP(C|B)∑AP(A)P(B|A). Check my post for more.
Level 2: Clique-tree based algorithms — e.g., Sum-product (SP) / Belief-update (BU) calibration algorithms
Cache the repeated computations across queries.
Suppose you have a fixed Bayes Net, and you want to compute the marginalization not only P(D), but also P(A). Clearly running two instances of Variable Elimination as above is going to contain some overlapping computation.
Clique-tree is a data structure where, given the initial factors (in this case the CPD tables), you “calibrate” a tree whose nodes correspond to a subset of the variables. Cost can be amortized by running many queries over the same Bayes Net.
Calibration can be done by just two passes across the tree, after which you have the joint marginals for all the nodes of the clique tree.
Incorporating evidence is equally simple. Just zero-out the entries of variables that you are conditioning on for some node, then “propagate” that information downwards via a single pass across the tree.
Level 3: Specialized query-set answering algorithms over a calibrated clique tree.
Cache the repeated computations across a certain query-class
e.g., computing P(X,Y) for every pair of variables can be done by using yet another layer of dynamic programming by maintaining a table of P(Ci,Cj) for each pair of clique-tree nodes ordered according to their distance in-between.
Man, deviation arguments are so cool:
what are macrostates? Variables which are required to make your thermodynamics theory work! If they don’t, add more macrostates!
nonequilibrium? Define it as systems that don’t admit a thermodynamic description!
inductive biases? Define it as the amount of correction needed for a system to obey Bayesian updating, i.e. correction terms in the exponent of the Gibbs measure!
coarse graining? Define the coarse-grained variables to keep the dynamics as close as possible to that of the micro-dynamics!
or in a similar spirit—does your biological system deviate from expected utility theory? Well, there’s discovery (and money) to be made!
It’s easy to get confused and think the circularity is a problem (“how can you define thermodynamics in terms of equilibriums, when equilibriums are defined using thermodynamics?”), but it’s all about carving nature at the right joints—and a sign that you made the right carving is that the amount of corrections needed to be applied aren’t too numerous, and they all seem “natural” (and of course, all of this while letting you make nontrivial predictions. that’s what matters at the end of the day).
Then, it’s often the case that those corrections also turn out to be meaningful and natural quantities of interest.
One of the rare insightful lessons from high school: Don’t set your AC to the minimum temperature even if it’s really hot, just set it to where you want it to be.
It’s not like the air released gets colder with lower target temperature, because most ACs (according to my teacher, I haven’t checked lol) are just a simple control system that turns itself on/off around the target temperature, meaning the time it takes to reach a certain temperature X is independent of the target temperature (as long it’s lower than X)
… which is embarrassingly obvious in hindsight.
Well is he is right about some ACs being simple on/off units.
But there also exists units than can change cycle speed, its basically the same thing except the motor driving the compression cycle can vary in speed.
In case you where wondering, they are called inverters. And when buying new today, you really should get an inverter (efficiency).
Quick thoughts on my plans:
I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis’s value-compilation hypothesis, etc) and forming my own.
I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It’s probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
(and also think about related theoretical arguments like Selection Theorem)
So I’ll probably allocate my research time:
Studying math (more linear algebra / dynamical systems / causal inference / statistical mechanics)
Sketching a better picture of agent development, assigning confidence, proposing high-bit experiments (that might have the side-effect of distinguishing between different conflicting pictures), formalization, etc.
and read relevant literature (eg ones on theoretic DL and inductive biases)
Upskilling mechanistic interpretability to actually start running quick experiments
Unguided research brainstorming (e.g., going through various alignment exercises, having a writeup of random related ideas, etc)
Possibly participate in programs like MATS? Probably the biggest benefit to me would be (1) commitment mechanism / additional motivation and (2) high-value conversations with other researchers.
Dunno, sounds pretty reasonable!
Useful perspective when thinking of mechanistic pictures of agent/value development is to take the “perspective” of different optimizers, consider their relative “power,” and how they interact with each other.
E.g., early on SGD is the dominant optimizer, which has the property of (having direct access to feedback from U / greedy). Later on early proto-GPS (general-purpose search) forms, which is less greedy, but still can largely be swayed by SGD (such as having its problem-specification-input tweaked, having the overall GPS-implementation modified, etc). Much later, GPS becomes the dominant optimizing force “at run-time” which shortens the relevant time-scale and we can ignore the SGD’s effect. This effect becomes much more pronounced after reflectivity + gradient hacking when the GPS’s optimization target becomes fixed.
(very much inspired by reading Thane Ruthenis’s value formation post)
This is a very useful approximation at the late-stage when the GPS self-modifies the agent in pursuit of its objective! Rather than having to meticulously think about local SGD gradient incentives and such, since GPS is non-greedy, we can directly model it as doing what’s obviously rational from a birds-eye-perspective.
(kinda similar to e.g., separation of timescale when analyzing dynamical systems)
It seems like retrieval-based transformers like RETRO is “obviously” the way to go—(1) there’s just no need to store all the factual information as fixed weights, (2) and it uses much less parameter/memory. Maybe mechanistic interpretability should start paying more attention to these type of architectures, especially since they’re probably going to be a more relevant form of architecture.
They might also be easier to interpret thanks to specialization!
I’ve noticed during my alignment study that just the sheer amount of relevant posts out there is giving me a pretty bad habit of (1) passively engaging with the material and (2) not doing much independent thinking. Just keeping up to date & distilling the stuff in my todo read list takes up most of my time.
I guess the reason I do it is because (at least for me) it takes a ton of mental effort to switch modes between “passive consumption” and “active thinking”:
I noticed then when self-studying math; like, my subjective experience is that I enjoy both “passively listening lectures+taking notes” and “solving practice problems,” the problem is that it takes a ton of mental energy to switch between the two equilibriums.
(This is actually still a problem—too much wide & passive consumption rather than actively practicing them and solving problems.)
Also relevant is wanting to just progress/upskill as fast and wide of a subject as I can, sacrificing mastery for diversity. This probably makes sense to some degree (especially in the sense that having more frames is good), but I think I’m taking this wayyyy too far.
My r for opening new links far exceeds 1. This definitely helped me when I was trying to get a rapid overview of the entire field, but now it’s just a bad adaptation + akrasia.
Okay, then, don’t do that! Some directions to move towards:
Independent brainstorming/investigation sessions to form concrete inside views
like the advice from the field guide post, exercises from the MATS model, et
Commitment mechanisms, like making regular posts or shortforms (eg)
There are lots of posts but the actual content is very thing. I would say there is plausibly more content in your real analysis book than there is in the entire alignment field.
Is there a case for AI gain-of-function research?
(Epistemic Status: I don’t endorse this yet, just thinking aloud. Please let me know if you want to act/research based on this idea)
It seems like it should be possible to materialize certain forms of AI alignment failure modes with today’s deep learning algorithms, if we directly optimize for their discovery. For example, training a Gradient Hacker Enzyme.
A possible benefit of this would be that it gives us bits of evidence wrt how such hypothesized risks would actually manifest in real training environments. While the similarities would be limited because the training setups would be optimizing for their discovery, it should at least serve as a good lower bound for the scenarios in which these risks could manifest.
Perhaps having a concrete bound for when dangerous capabilities appear (eg a X parameter model trained in Y modality has Z chance of forming a gradient hacker) would make it easier for policy folks to push for regulations.
Is AI gain-of-function equally dangerous as biotech gain-of-function? Some arguments in favor (of the former being dangerous):
The malicious actor argument is probably stronger for AI gain-of-function.
if someone publicly releases a Gradient Hacker Enzyme, this lowers the resource that would be needed for a malicious actor to develop a misaligned AI (eg plug in the misaligned Enzyme at an otherwise benign low-capability training run).
Risky researcher incentive is equally strong.
e.g., a research lab carelessly pursuing gain-of-function research, deliberately starting risky training runs for financial/academic incentives, etc.
Some arguments against:
Accident risks from financial incentives are probably weaker for AI gain-of-function.
The standard gain-of-function risk scenario is: research lab engineers a dangerous pathogen, it accidentally leaks, and a pandemic happens.
I don’t see how these events would happen “accidentally” when dealing with AI programs; e.g., the researcher would have to deliberately cut parts of the network weights and replace it with the enzyme, which is certainly intentional.
Random alignment-related idea: train and investigate a “Gradient Hacker Enzyme”
TL;DR, Use meta-learning methods like MAML to train a network submodule i.e. circuit that would resist gradient updates in a wide variety of contexts (various architectures, hyperparameters, modality, etc), and use mechanistic interpretability to see how it works.
It should be possible to have a training setup for goals other than “resist gradient updates,” such as restricting the meta-objective to a specific sub-sub-circuit. In that case, the outer circuit might (1) instrumentally resist updates, or (2) somehow get modified while keeping its original behavioral objective intact.
This setup doesn’t have to be restricted to circuits of course; there was a previous work which did this on the level of activations, although iiuc the model found a trivial solution by exploiting relu—it would be interesting to extend this to more diverse setup.
Anyways, varying this “sub-sub-circuit/activation-to-be-preserved” over different meta-learning episodes would incentivize the training process to find “general” Gradient Hacker designs that aren’t specific to a particular circuit/activation—a potential precursor for various forms of advanced Gradient Hackers (and some loose analogies to how enzymes accelerate reactions).
What is the Theory of Impact for training a “Gradient Hacker Enzyme”?
(note: while I think these are valid, they’re generated post-hoc and don’t reflect the actual process for me coming up with this idea)
Estimating the lower-bound for the emergence Gradient Hackers.
By varying the meta-learning setups we can get an empirical estimate for the conditions in which Gradient Hackers are possible.
Perhaps gradient hackers are actually trivial to construct using tricks we haven’t thought of before (like the relu example before). Maybe not! Perhaps they require [high-model-complexity/certain-modality/reflective-agent/etc].
Why lower-bound? In a real training environment, gradient hackers appear because of (presumably) convergent training incentives. Instead in the meta-learning setup, we’re directly optimizing for gradient hackers.
Mechanistically understanding how Gradient Hackers work.
Applying mechanistic interpretability here might not be too difficult, because the circuit is cleanly separated from the rest of the model.
There has been several speculations on how such circuits might emerge. Testing them empirically sounds like a good idea!
This is just a random idea and I’m probably not going to work on it; but if you’re interested, let me know. While I don’t think this is capabilities-relevant, this probably falls under AI gain-of-function research and should be done with caution.
Update: I’m trying to upskill mechanistic interpretability, and training a Gradient Hacker Enzyme seems like a fairly good project just to get myself started.
I don’t think this project would be highly valuable in and of itself (although I would definitely learn a lot!), so one failure mode I need to avoid is ending up investing too much of my time in this idea. I’ll probably spend a total of ~1 week working on it.