I want to go a bit deep here on “maximum entropy” and misunderstandings thereof by the straw-man Humbali character, mostly to clarify things for myself, but also in the hopes that others might find it useful. I make no claim to novelty here—I think all this ground was covered by Jaynes (1968)—but I do have a sense that this perspective (and the measure-theoretic intuition behind it) is not pervasive around here, the way Bayesian updating is.

First, I want to point out that entropy of a probability measure p is only definable relative to a base measure μ, as follows:

Hμ(p)=−∫Xdpdμ(x)logdpdμ(x)dμ(x)

(The derivatives notated here denote Radon-Nikodym derivatives; the integral is Lebesgue.) Shannon’s formulae, the discrete H(p)=−∑ip(xi)logp(xi) and the continuous H(p)=−∫Xp(x)logp(x)dx, are the special cases of this where μ is assumed to be counting measure or Lebesgue measure, respectively. These formulae actually treat p as having a subtly different type than “probability measure”: namely, they treat it as a density with respect to counting measure (a “probability mass function”) or a density with respect to Lebesgue measure (a “probability density function”), and implicitly supply the corresponding μ.

If you’re familiar with Kullback–Leibler divergence (DKL), and especially if you’ve heard DKL called “relative entropy,” you may have already surmised that Hμ(p)=−DKL(p||μ). Usually, KL divergence is defined with both arguments being probability measures (measures that add up to 1), but that’s not required for it to be well-defined (what is required is absolute continuity, which is sort of orthogonal). The principle of “maximum entropy,” or argmaxpHμ(p), is equivalent to argminpDKL(p||μ). In the absence of additional constraints on p, the solution of this is p=μ, so maximum entropy makes sense as a rule for minimum confidence to exactly the same extent that the implicit base measure μ makes sense as a prior. The principle of maximum entropy should really be called “the principle of minimum updating”, i.e., making a minimum-KL-divergence move from your prior μ to your posterior p when the posterior is constrained to exactly agree with observed facts. (Standard Bayesian updating can be derived as a special case of this.)

Sometimes, the structure of a situation has some symmetry group with respect to which the situation of uncertainty seems to be invariant, with classic examples being relabeling heads/tails on a coin, or arbitrarily permuting a shuffled deck of cards. In these examples, the requirement that a prior be invariant with respect to those symmetries (in Jaynes’ terms, the principle of transformation groups) uniquely characterizes counting measure as the only consistent prior (the classical principle of indifference, which still lies at the core of grade-school probability theory). In other cases, like a continuous roulette wheel, other Haar measures (which generalize both counting and Lebesgue measure) are justified. But taking “indifference” or “insufficient reason” to justify using an invariant measure as a prior in an arbitrary situation (as Laplace apparently did) is fraught with difficulties:

Most obviously, the invariant measure on [−∞,∞] with respect to translations, namely Lebesgue measure, is an improper prior: it is a non-probability measure because its integral (formally, ∫∞−∞1dμ(x)) is infinite. If we’re talking about forecasting the timing of a future event, (0,∞) is a very natural space, but ∫∞01dμ(x) is no less infinite. Discretizing into year-sized buckets doesn’t help, since counting measure on N is also infinite (formally, ∑ωi=01=∞). In the context of maximum entropy, using an infinite measure for μ means that there is no maximum-entropy p—you can always get more entropy by spreading the probability mass even thinner.

But what if we discretize and also impose a nothing-up-my-sleeve outrageous-but-finite upper bound, like the maximum binary64 number at around 1.8×10308? Counting measure on {i:N|i<1.8×10308} can be normalized into a probability measure, so what stops that from being a reasonable “unconfident” prior? Sometimes this trick can work, but the deeper issue is that the original symmetry-invariance argument that successfully justifies counting measure for shuffled cards just makes no sense here. If one relabels all the years, say reversing their order, the situation of uncertainty is decidedly not equivalent.

Another difficulty with using invariant measures as priors as a general rule is that they are not always uniquely characterized, as in the Bertrand paradox, or the obvious incompatibility between uniform priors (invariant to addition) and log-uniform priors (invariant to multiplication).

I think Humbali’s confusion can be partially explained as conflating an invariant measure and a prior—in both directions:

First, Humbali implicitly uses a translation-invariant base measure as a prior when he claims as absolute a notion of “entropy” which is actually relative to that particular base measure. Something like this mistake was made by both Laplace and Shannon, so Humbali is in good company here—but already confused, because translation on the time axis is not a symmetry with respect to which forecasts ought to be invariant.

Then, when cornered about a particular absurd prediction that inevitably arises from the first mistake, Humbali implicitly uses his (socially-driven) prior as a base measure, when he says “somebody with a wide probability distribution over AGI arrival spread over the next century, with a median in 30 years, is in realistic terms about as uncertain as anybody could possibly be.” Assuming he’s still using “entropy” at all as the barometer of virtuous unconfidence, he’s now saying that the way to fix the absurd conclusions of maximum-entropy relative to Lebesgue measure is that one really ought to measure unconfidence with respect to a socially-adjusted “base rate” measure, which just happens to be his own prior. (I think the lexical overlap between “base rate” and “base measure” is not a coincidence.) This second position is more in bad-faith than the first because it still has the bluster of objectivity without any grounding at all, but it has more hope of formal coherence: one can imagine a system of collectively navigating uncertainty where publicly maintaining one’s own epistemic negentropy, explicitly relative to some kind of social median, comes at a cost (e.g. hypothetically or literally wagering with others).

There is a bit of motte-and-bailey uncovered by the bad-faith in position 2. Humbali all along primarily wants to defend his prior as unquestionably reasonable (the bailey), and when he brings up “maximum entropy” in the first place, he’s retreating to the motte of Lebesgue measure, which seems to have a formidable air of mathematical objectivity about it. Indeed, by its lights, Humbali’s own prior does happen to have more entropy than Eliezer’s, though Lebesgue measure fails to support the full bailey of Humbali’s actual prior. However, in this case even the motte is not defensible, since Lebesgue measure is an improper prior and the translation-invariance that might justify it simply has no relevance in this context.

Meta: any feedback about how best to make use of the channels here (commenting, shortform, posting, perhaps others I’m not aware of) is very welcome; I’m new to actually contributing content on AF.

Ha, I was just about to write this post.
To add something, I think you can justify the uniform measure on bounded intervals of reals (for illustration purposes, say [0,1]) by the following argument: “Measuring a real number x∈[0,1]” is obviously simply impossible if interpreted literally, containing an infinite amount of data. Instead this is supposed to be some sort of idealization of a situation where you can observe “as many bits as you want” of the binary expansion of the number (choosing another base gives the same measure). If you now apply the principle of indifference to each measured bit, you’re left with Lebesgue measure.

It’s not clear that there’s a “right” way to apply this type of thinking to produce “the correct” prior on N (or R or any other non-compact space.

Given any particular admissible representation of a topological space, I do agree you can generate a Borel probability measure by pushing forward the Haar measure of the digit-string space ΣN (considered as a countable product of ω copies of Σ, considered as a group with the modular-arithmetic structure of Z/|Σ|) along the representation. This construction is studied in detail in (Mislove, 2015).

But, actually, the representation itself (in this case, the Cantor map) smuggles in Lebesgue measure, because each digit happens to cut the target space “in half” according to Lebesgue measure. If I postcompose, say, x↦√x after the Cantor map, that is also an admissible representation of [0,1], but it no longer induces Lebesgue measure. This works for any continuous bijection, so any absolutely continuous probability measure on [0,1] can be induced by such a representation. In fact, this is why the inverse-CDF algorithm for drawing samples from arbitrary distributions, given only uniform random bits, works.

That being said, you can apply this to non-compact spaces. I could get a probability measure on R via a decimal representation, where, say, the number of leading zeros encodes the exponent in unary and the rest is the mantissa. [Edit: I didn’t think this through all the way, and it can only represent real numbers ≥1. No big obstacle; post-compose x↦log(x−1).] The reason there doesn’t seem to be a “correct” way to do so is that, because there’s no Haar probability measure on non-compact spaces (at least, usually?), there’s no digit representation that happens to cut up the space “evenly” according to such a canonical probability measure.

Right, that’s essentially what I mean. You’re of course right that this doesn’t let you get around the existence of nonmeasurepreserving automorphisms. I guess what I’m saying is that, if you’re trying to find a prior on [0,1], you should try to think about what system of finite measurements this is idealizing, and see if you can apply a symmetry argument to those bits. Which isn’t always the case! You can only apply the principle of indifference if you’re actually indifferent. But if it’s the case that X∈[0,1] is generated in a way where “there’s no reason to suspect that any bit should be 0 rather than 1, or that there should be correlations between the bits”, then it’s of course not the case that √X has this same property. But of course you still need to look at the actual situation to see what symmetry exists or doesn’t exist.

Haar measures are a high-powered way of doing this, I was just thinking about taking the iterated product measure of the uniform probability measure on {0,1} (justified by symmetry considerations). You can of course find maps from {0,1}∞ to all sorts of spaces but it seems harder to transport symmetry considerations along these maps.

Out of curiosity, this morning I did a literature search about “hard-coded optimization” in the gradient-based learning space—that is, people deliberately setting up “inner” optimizers in their neural networks because it seems like a good way to solve tasks. (To clarify, I don’t mean deliberately trying to make a general-purpose architecture learn an optimization algorithm, but rather, baking an optimization algorithm into an architecture and letting the architecture learn what to do with it.)

Why is this interesting?

The most compelling arguments in Risks from Learned Optimzation that mesa-optimizers will appear involve competitiveness: incorporating online optimization into a policy can help with generalization, compression, etc.

If inference-time optimization really does help competitiveness, we should expect to see some of the relevant competitors trying to do it on purpose.

I recall some folks saying in 2019 that the apparent lack of this seemed like evidence against the arguments that mesa-optimizers will be competitive.

To the extent there is now a trend toward explicit usage of inference-time optimizers, that supports the arguments that mesa-optimizers would be competitive, and thus may emerge accidentally as general-purpose architectures scale up.

More importantly (and also mentioned in Risks from Learned Optimzation, as “hard-coded optimization”), if the above arguments hold, then it would help safety to bake in inference-time optimization on purpose, since we can better control and understand optimization when it’s engineered—assuming that engineering it doesn’t sacrifice task performance (so that the incentive for the base optimizer to evolve a de novo mesa-optimizer is removed).

So, engineered inference-time optimization is plausibly one of those few capabilities research directions that is weaklydifferential tech development in the sense that it accelerates safe AI more than it accelerates unsafe AI (although it accelerates both). I’m not confident enough about this to say that it’s a good direction to work on, but I am saying it seems like a good direction to be aware of and occasionally glance at.

My impression is that a majority of AI alignment/safety agendas/proposals the last few years have carried a standard caveat that they don’t address the inner alignment problem at all, or at least deceptive alignment in particular.

As far as I can tell, there are few public directions about how to address deceptive alignment:

The ELK report has an intriguing paragraph suggesting that solutions to ELK might transfer to deceptive alignment via an analogy about “honesty”, but my sense is that solving ELK for learned optimizers (which has its own, much longer section of the report) will likely require additional new ideas beyond a solution to the “base case” of ELK (although I do agree with the report that the latter is currently a more exciting research direction than thinking explicitly about learned optimizers).

Although hard-coding optimization certainly doesn’t rule out learned optimization, I’m optimistic that it may be an important component of a suite of safety mechanisms (combined with “transparency via ELK” and perhaps one or two other major ideas, which may not be discovered yet) that finally rule out deceptive alignment.

The introduction here is a great explanation of why ML engineers want to incorporate an optimizer into a learning system, rather than learning a predictive model and then applying an engineered optimizer.

But their main theoretical claims rest on a concept of “predictive model” that only outputs point estimates, rather than distributions (and so cannot represent correlated variables).

This was my starting point—a very neat piece of work from the Boyd group (probably the top convex-optimization lab in US academia), where the cone program solver in CVXPY is adapted to not only calculate solutions for the forward pass, but also pull gradient covectors in the solution space back to gradient covectors on the problem specification for the backward pass.

A way to approximate dynamic programming as convex programming—for combinatorial optimization and belief propagation, though, rather than reinforcement learning.

Adapts an old-school symbolic-reasoning-style method based on integer linear programming to be a convex differentiable layer that sits on top of a 110M-parameter BERT.

Training the two together:

of course makes a huge improvement over the old-school method by itself,

a noticeable improvement over the BERT by itself, and

even a marginal improvement over a much larger (340M-parameter) BERT by itself.

Here, a differentiable convex optimization layer is used as a subproblem of a physics predictor, specifically for contact dynamics—based on the physical principle that contact forces maximize the rate of energy dissipation. The physics predictor is then wrapped in a gradient-based optimization to compute optimal actions starting from a given initial state.

This is a neat motivating example for a mesa-mesa-optimizer: layer 1 wants a good policy, layer 2 wants a good action, and layer 3 wants a good prediction (where physics itself can sometimes best be predicted using an optimizer).

This is a follow-up from the Boyd group where they apply their idea to a bunch of toy problems. Most compelling is section 5.4 on vehicle control; the financial portfolio and supply-chain problems are also interesting.

This isn’t actually building a hard-coded optimizer into a gradient-based learning architecture—instead, it’s better described as building a bit of gradient-based learning into a hard-coded optimizer. Tangential but still interesting.

An intriguing approach to something like ELK is done here, where an optimal policy parameterized by unknowns about the actual dynamics of the building/HVAC system is trained with imitation learning from expert demonstrations, and thereby learns the parameters of the system dynamics.

Convex optimization layers are applied to solve for efficient power generation/transmission choices, while the neural-network layers are used to model the unpredictable output of renewable sources.

Here are some other potentially relevant papers I haven’t processed yet:

Ma, Hengbo, Bike Zhang, Masayoshi Tomizuka, and Koushil Sreenath. “Learning Differentiable Safety-Critical Control Using Control Barrier Functions for Generalization to Novel Environments.” ArXiv:2201.01347 [Cs, Eess], January 7, 2022. http://arxiv.org/abs/2201.01347.

Rojas, Junior, Eftychios Sifakis, and Ladislav Kavan. “Differentiable Implicit Soft-Body Physics.” ArXiv:2102.05791 [Cs], September 9, 2021. http://arxiv.org/abs/2102.05791.

Srinivas, Aravind, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control.” In Proceedings of the 35th International Conference on Machine Learning, 4732–41. PMLR, 2018. https://proceedings.mlr.press/v80/srinivas18b.html.

For various collective-epistemics and cooperative-decision-making endeavours, I think a key technical enabler might be DVCS for structured data. To that end, I am interested in funding work in this direction. Aside from being in a position to allocate some funding, I think I have some comparative advantage in a broad inside-view awareness of potentially relevant theoretical footholds, and this post is intended to start unfurling that inside view, initially as a list of links. People who I fund to work on this should read the abstracts of all of these papers, pay special attention to those marked with (!), skim/read further as they see fit, and cite them in their writeups (at least under “related work”). I’m posting this here as part of a general policy of using this platform for any of my even-vaguely-technical output that goes beyond tweets.

This paper describes the theory and implementation of an operation (double-pushout-rewriting a.k.a. DPO-rewriting) on a data model (C-sets) which is what I’m currently most excited about exploring as a notion of “patch” (or “commit”) in DVCS of structured data.

(!) One specific type of structured data I’m particularly interested in applying the proposed work to is hypernets, a combinatorial representation of string diagrams, defined alongside a double-pushout-rewriting scheme for them in section 4 of Functorial String Diagrams for Reverse-mode Automatic Differentiation. Ideally, this should be a special case of the above.

Another one I’m interested in is Polynomial Circuits, which can be represented as hypernets.

This paper introduces a consistency criterion that is weak enough to be satisfied by most CRDTs, but stronger than mere eventual consistency. I would expect any DVCS to probably satisfy this criterion.

This short paper outlines some simple, git-like mechanisms for rejecting invalid messages that I would want to see in basically any decentralized system.

I would like the system to support incremental updating of the outputs of functions defined on the structured data (a.k.a. “views” in the database sense), and I think this paper describes a good building block for that. I have flagged it for special attention because I want compatibility with the Datalog model to be in the back of our minds as we make design decisions.

This is an alternative data-model where database states are functors from C→Rel instead of functors from C→Set. This is in some ways closer to the relational database model, so there might be some interest in generalizing the C-Set DPO algorithm to “C-Rel”s. However, morphisms in Rel can be represented as spans in Set, and that’s probably easier.

“Concern, Respect, and Cooperation” is a contemporary moral-philosophy book by Garrett Cullity which advocates for a pluralistic foundation of morality, based on three distinct principles:

Concern: Moral patients’ welfare calls for promotion, protection, sensitivity, etc.

Respect: Moral patients’ self-expression calls for non-interference, listening, address, etc.

Cooperation: Worthwhile collective action calls for initiation, joining in, collective deliberation, sharing responsibility, etc.
And one bonus principle, whose necessity he’s unsure of:

Protection: Precious objects call for protection, appreciation, and communication of the appreciation.

What I recently noticed here and want to write down is a loose correspondence between these different foundations for morality and some approaches to safe superintelligence:

CEV-maximization corresponds to finding a good enough definition of human welfare that Comcern alone suffices for safety.

Corrigibility corresponds to operationalizating some notion of Respect that would alone suffice for safety.

Multi-agent approaches lean in the direction of Cooperation.

Approaches that aim to just solve literally the “superintelligence that doesn’t destroy us” problem, without regard for the cosmic endowment, sometimes look like Protection.

Cullity argues that none of his principles is individually a satisfying foundation for morality, but that all four together (elaborated in certain ways with many caveats) seem adequate (and maybe just the first three). I have a similar intuition about AI safety approaches. I can’t yet make the analogy precise, but I feel worried when I imagine corrigibility alone, CEV alone, bargaining alone (whether causal or acausal), or Earth-as-wildlife-preserve; whereas I feel pretty good imagining a superintelligence that somehow balances all four. I can imagine that one of them might suffice as a foundation for the others, but I think this would be path-dependent at best. I would be excited about work that tries to do for Cullity’s entire framework what CEV does for pure single-agent utilitarianism (namely, make it more coherent and robust and closer to something that could be formally specified).

Useful primitives for incentivizing alignment-relevant metrics without compromising on task performance might include methods like Orthogonal Gradient Descent or Averaged Gradient Episodic Memory, evaluated and published in the setting of continual learning or multi-task learning. Something like “answer questions honestly” could mathematically be thought of as an additional task to learn, rather than as an inductive bias or regularization to incorporate. And I think these two training modifications are quite natural (I just came to essentially the same ideas independently and then thought “if either of these would work then surely the multi-task learning folks would be doing them?” and then I checked and indeed they are). Just some more nifty widgets to add to my/our toolbox.

Re: alignment tasks in multi-task settings. I think this makes a lot of sense. Especially in worlds where we have a lot of ML/AI systems doing a bunch of different things, even if they have very different specific tasks, the “library” of alignment objectives is probably widely shared.

This is one question where my odds are pretty far from being coherently Bayesian; I would bet against only up to around 100:1, but I wouldn’t bet in favor until somewhere around 8e7:1.

Why is that not Bayesian? The decision to bet is going to include a term for your counterparty’s willingness to bet, which is some evidence.

One way to overcome this in thought experiments is to frame it as a line with no spread, and no neutral option. At, say, 150k:1, which side would you bet? Then adjust to where you’d SET a line where you had to accept a wager on either side.

I want to go a bit deep here on “maximum entropy” and misunderstandings thereof by the straw-man Humbali character, mostly to clarify things for myself, but also in the hopes that others might find it useful. I make no claim to novelty here—I think all this ground was covered by Jaynes (1968)—but I do have a sense that this perspective (and the measure-theoretic intuition behind it) is not pervasive around here, the way Bayesian updating is.

First, I want to point out that entropy of a probability measure p is only definable relative to a base measure μ, as follows:

Hμ(p)=−∫Xdpdμ(x)logdpdμ(x)dμ(x)(The derivatives notated here denote Radon-Nikodym derivatives; the integral is Lebesgue.) Shannon’s formulae, the discrete H(p)=−∑ip(xi)logp(xi) and the continuous H(p)=−∫Xp(x)logp(x)dx, are the special cases of this where μ is assumed to be counting measure or Lebesgue measure, respectively. These formulae actually treat p as having a subtly different

typethan “probability measure”: namely, they treat it as a density with respect to counting measure (a “probability mass function”) or a density with respect to Lebesgue measure (a “probability density function”), and implicitly supply the corresponding μ.If you’re familiar with Kullback–Leibler divergence (DKL), and especially if you’ve heard DKL called “relative entropy,” you may have already surmised that Hμ(p)=−DKL(p||μ). Usually, KL divergence is defined with both arguments being

probabilitymeasures (measures that add up to 1), but that’s not required for it to be well-defined (what is required is absolute continuity, which is sort of orthogonal). The principle of “maximum entropy,” or argmaxpHμ(p), is equivalent to argminpDKL(p||μ). In the absence of additional constraints on p, the solution of this is p=μ, so maximum entropy makes sense as a rule for minimum confidenceto exactly the same extent that the implicit base measureμmakes sense as a prior. The principle of maximum entropy should really be called “the principle of minimum updating”, i.e., making a minimum-KL-divergence move from your prior μ to your posterior p when the posterior is constrained to exactly agree with observed facts. (Standard Bayesian updating can be derived as a special case of this.)Sometimes, the structure of a situation has some symmetry group with respect to which the situation of uncertainty seems to be invariant, with classic examples being relabeling heads/tails on a coin, or arbitrarily permuting a shuffled deck of cards. In these examples, the requirement that a prior be invariant with respect to those symmetries (in Jaynes’ terms, the principle of transformation groups) uniquely characterizes counting measure as the only consistent prior (the classical principle of indifference, which still lies at the core of grade-school probability theory). In other cases, like a continuous roulette wheel, other Haar measures (which generalize both counting and Lebesgue measure) are justified. But taking “indifference” or “insufficient reason” to justify using an invariant measure as a prior

in an arbitrary situation(as Laplace apparently did) is fraught with difficulties:Most obviously, the invariant measure on [−∞,∞] with respect to translations, namely Lebesgue measure, is an

improper prior: it is a non-probability measure because its integral (formally, ∫∞−∞1dμ(x)) is infinite. If we’re talking about forecasting the timing of a future event, (0,∞) is a very natural space, but ∫∞01dμ(x) is no less infinite. Discretizing into year-sized buckets doesn’t help, since counting measure on N is also infinite (formally, ∑ωi=01=∞). In the context of maximum entropy, using an infinite measure for μ means that thereisno maximum-entropy p—you can always get more entropy by spreading the probability masseven thinner.But what if we discretize

andalso impose a nothing-up-my-sleeve outrageous-but-finite upper bound, like the maximum`binary64`

number at around 1.8×10308? Counting measure on {i:N|i<1.8×10308} can be normalized into a probability measure, so what stopsthatfrom being a reasonable “unconfident” prior? Sometimes this trick can work, but the deeper issue is thatthe original symmetry-invariance argument that successfully justifies counting measure for shuffled cards just makes no sense here. If one relabels all the years, say reversing their order, the situation of uncertainty is decidedlynotequivalent.Another difficulty with using invariant measures as priors as a general rule is that they are not always uniquely characterized, as in the Bertrand paradox, or the obvious incompatibility between uniform priors (invariant to addition) and log-uniform priors (invariant to multiplication).

I think Humbali’s confusion can be partially explained as conflating an invariant measure and a prior—in

bothdirections:First, Humbali

implicitly uses a translation-invariant base measure as a priorwhen he claims as absolute a notion of “entropy” which is actually relative to that particular base measure. Something like this mistake was made by both Laplace and Shannon, so Humbali is in good company here—but already confused, because translation on the time axis is not a symmetry with respect to which forecasts ought to be invariant.Then, when cornered about a particular absurd prediction that inevitably arises from the first mistake, Humbali

implicitly uses his (socially-driven) prior as a base measure, when he says “somebody with a wide probability distribution over AGI arrival spread over the next century, with a median in 30 years, is in realistic terms about as uncertain as anybody could possibly be.” Assuming he’s still using “entropy” at all as the barometer of virtuous unconfidence, he’s now saying that the way to fix the absurd conclusions of maximum-entropy relative to Lebesgue measure is that onereallyought to measure unconfidence with respect to a socially-adjusted “base rate” measure, which just happens to be his own prior. (I think the lexical overlap between “base rate” and “base measure” is not a coincidence.) This second position is more in bad-faith than the first because it still has the bluster of objectivity without any grounding at all, but it has more hope of formal coherence: one can imagine a system of collectively navigating uncertainty where publicly maintaining one’s own epistemic negentropy,explicitlyrelative to some kind of social median, comes at a cost (e.g. hypothetically or literally wagering with others).There is a bit of motte-and-bailey uncovered by the bad-faith in position 2. Humbali all along primarily wants to defend his prior as unquestionably reasonable (the bailey), and when he brings up “maximum entropy” in the first place, he’s retreating to the motte of Lebesgue measure, which seems to have a formidable air of mathematical objectivity about it. Indeed, by its lights, Humbali’s own prior does happen to have more entropy than Eliezer’s, though Lebesgue measure fails to support the full bailey of Humbali’s actual prior. However, in this case even the motte is not defensible, since Lebesgue measure is an improper prior and the translation-invariance that might justify it simply has no relevance in this context.

Meta: any feedback about how best to make use of the channels here (commenting, shortform, posting, perhaps others I’m not aware of) is very welcome; I’m new to actually contributing content on AF.

Ha, I was just about to write this post. To add something, I think you can justify the uniform measure on bounded intervals of reals (for illustration purposes, say [0,1]) by the following argument: “Measuring a real number x∈[0,1]” is obviously simply impossible if interpreted literally, containing an infinite amount of data. Instead this is supposed to be some sort of idealization of a situation where you can observe “as many bits as you want” of the binary expansion of the number (choosing another base gives the same measure). If you now apply the principle of indifference to each measured bit, you’re left with Lebesgue measure.

It’s not clear that there’s a “right” way to apply this type of thinking to produce “the correct” prior on N (or R or any other non-compact space.

Given any particular admissible representation of a topological space, I do agree you can generate a Borel probability measure by pushing forward the Haar measure of the digit-string space ΣN (considered as a countable product of ω copies of Σ, considered as a group with the modular-arithmetic structure of Z/|Σ|) along the representation. This construction is studied in detail in (Mislove, 2015).

But, actually, the representation itself (in this case, the Cantor map) smuggles in Lebesgue measure, because each digit happens to cut the target space “in half” according to Lebesgue measure. If I postcompose, say, x↦√x after the Cantor map, that is also an admissible representation of [0,1], but it no longer induces Lebesgue measure. This works for any continuous bijection, so

anyabsolutely continuous probability measure on [0,1] can be induced by such a representation. In fact, this is why the inverse-CDF algorithm for drawing samples from arbitrary distributions, given only uniform random bits, works.That being said, you

canapply this to non-compact spaces. I could get a probability measure on R via a decimal representation, where, say, the number of leading zeros encodes the exponent in unary and the rest is the mantissa. [Edit: I didn’t think this through all the way, and it can only represent real numbers ≥1. No big obstacle; post-compose x↦log(x−1).] The reason there doesn’t seem to be a “correct” way to do so is that, because there’s no Haar probability measure on non-compact spaces (at least, usually?), there’s no digit representation that happens to cut up the space “evenly” according to such a canonical probability measure.Right, that’s essentially what I mean. You’re of course right that this doesn’t let you get around the existence of nonmeasurepreserving automorphisms. I guess what I’m saying is that, if you’re trying to find a prior on [0,1], you should try to think about what system of finite measurements this is idealizing, and see if you can apply a symmetry argument to those bits. Which isn’t always the case! You can only apply the principle of indifference if you’re actually indifferent. But

ifit’s the case that X∈[0,1] is generated in a way where “there’s no reason to suspect that any bit should be 0 rather than 1, or that there should be correlations between the bits”, then it’s of course not the case that √X has this same property. But of course you still need to look at the actual situation to see what symmetry exists or doesn’t exist.Haar measures are a high-powered way of doing this, I was just thinking about taking the iterated product measure of the uniform probability measure on {0,1} (justified by symmetry considerations). You can of course find maps from {0,1}∞ to all sorts of spaces but it seems harder to transport symmetry considerations along these maps.

Out of curiosity, this morning I did a literature search about “hard-coded optimization” in the gradient-based learning space—that is, people deliberately setting up “inner” optimizers in their neural networks because it seems like a good way to solve tasks. (To clarify, I don’t mean deliberately trying to make a general-purpose architecture learn an optimization algorithm, but rather, baking an optimization algorithm into an architecture and letting the architecture learn what to do with it.)

Why is this interesting?The most compelling arguments in Risks from Learned Optimzation that mesa-optimizers will appear involve competitiveness: incorporating online optimization into a policy can help with generalization, compression, etc.

If inference-time optimization really

doeshelp competitiveness, we should expect to see some of the relevant competitors trying to do it on purpose.I recall some folks saying in 2019 that the apparent lack of this seemed like evidence against the arguments that mesa-optimizers will be competitive.

To the extent there is now a trend toward explicit usage of inference-time optimizers, that supports the arguments that mesa-optimizers would be competitive, and thus may emerge accidentally as general-purpose architectures scale up.

More importantly (and also mentioned in Risks from Learned Optimzation, as “hard-coded optimization”), if the above arguments hold, then it would

helpsafety to bake in inference-time optimization on purpose, since we can better control and understand optimization when it’s engineered—assuming that engineering it doesn’t sacrifice task performance (so that the incentive for the base optimizer to evolve ade novomesa-optimizer is removed).So, engineered inference-time optimization is plausibly one of those few capabilities research directions that is

weaklydifferential tech development in the sense that it accelerates safe AImorethan it accelerates unsafe AI (although it accelerates both). I’m not confident enough about this to say that it’s a good direction to work on, but I am saying it seems like a good direction to be aware of and occasionally glance at.My impression is that a majority of AI alignment/safety agendas/proposals the last few years have carried a standard caveat that they don’t address the inner alignment problem at all, or at least deceptive alignment in particular.

As far as I can tell, there are few public directions about how to address deceptive alignment:

Myopia (probably trading off competitiveness)

Penalizing complexity (description, computational, etc.; seems to be rough consensus this won’t work)

Transparency via inspection

Adversarial training (aka transparency via training)

Hard-coded optimization (aka transparency via architecture)

The ELK report has an intriguing paragraph suggesting that solutions to ELK might transfer to deceptive alignment via an analogy about “honesty”, but my sense is that solving ELK for learned optimizers (which has its own, much longer section of the report) will likely require additional new ideas beyond a solution to the “base case” of ELK (although I do agree with the report that the latter is currently a more exciting research direction than thinking explicitly about learned optimizers).

Although hard-coding optimization certainly doesn’t rule out learned optimization, I’m optimistic that it may be an important component of a suite of safety mechanisms (combined with “transparency via ELK” and perhaps one or two other major ideas, which may not be discovered yet) that finally rule out deceptive alignment.

Anyway, here’s (some of)

what I found:Task-independent work

Cameron, Chris, Jason Hartford, Taylor Lundy, and Kevin Leyton-Brown. “

The Perils of Learning Before Optimizing.”AAAI 2022, December 16, 2021The introduction here is a great explanation of why ML engineers

wantto incorporate an optimizer into a learning system, rather than learning a predictive model and then applying an engineered optimizer.But their main theoretical claims rest on a concept of “predictive model” that only outputs point estimates, rather than distributions (and so cannot represent correlated variables).

Agrawal, Akshay, Brandon Amos, Shane Barratt, Stephen Boyd, et al. “

Differentiable Convex Optimization Layers.”NIPS 2019, October 28, 2019This was my starting point—a very neat piece of work from the Boyd group (probably the top convex-optimization lab in US academia), where the cone program solver in CVXPY is adapted to not only calculate solutions for the forward pass, but also pull gradient covectors in the solution space back to gradient covectors on the problem specification for the backward pass.

Butler, Andrew, and Roy Kwon. “

Efficient Differentiable Quadratic Programming Layers: An ADMM Approach.”ArXiv:2112.07464, December 14, 2021A recent development of a much more efficient algorithm for the special case of differentiating through a quadratic program solver.

Mensch, Arthur, and Mathieu Blondel. “

Differentiable Dynamic Programmingfor Structured Prediction and Attention.”ICML 2018, February 20, 2018A way to approximate dynamic programming as convex programming—for combinatorial optimization and belief propagation, though, rather than reinforcement learning.

Gilton, Davis, Gregory Ongie, and Rebecca Willett. “

Deep Equilibrium Architectures for Inverse Problemsin Imaging.”ArXiv:2102.07944, June 2, 2021A very nifty recipe for back-propagating through a fixed-point algorithm (adapting the same fixed-point algorithm to calculate the backward pass).

Also has task-specific parts for reconstructing 2D and 3D images

NLP

Thayaparan, Mokanarangan, Marco Valentino, Deborah Ferreira, Julia Rozanova, and André Freitas. “

∂-Explainer: Abductive Natural Language Inference via Differentiable Convex Optimization.”ArXiv:2105.03417, May 17, 2021Adapts an old-school symbolic-reasoning-style method based on integer linear programming to be a convex differentiable layer that sits on top of a 110M-parameter BERT.

Training the two together:

of course makes a huge improvement over the old-school method by itself,

a noticeable improvement over the BERT by itself, and

even a marginal improvement over a much larger (340M-parameter) BERT by itself.

Robotics

Zhong, Yaofeng Desmond, Biswadip Dey, and Amit Chakraborty. “

Extending Lagrangian and Hamiltonian Neural Networks with Differentiable Contact Models.”NIPS 2021, November 12, 2021Here, a differentiable convex optimization layer is used as a subproblem of a

physics predictor, specifically for contact dynamics—based on the physical principle that contact forces maximize the rate of energy dissipation. The physics predictor is then wrapped in a gradient-based optimization to compute optimal actions starting from a given initial state.This is a neat motivating example for a mesa-mesa-optimizer: layer 1 wants a good policy, layer 2 wants a good action, and layer 3 wants a good prediction (where physics itself can sometimes best be predicted using an optimizer).

Agrawal, Akshay, Shane Barratt, Stephen Boyd, and Bartolomeo Stellato. “

Learning Convex Optimization Control Policies.”ArXiv:1912.09529, December 19, 2019This is a follow-up from the Boyd group where they apply their idea to a bunch of toy problems. Most compelling is section 5.4 on vehicle control; the financial portfolio and supply-chain problems are also interesting.

Srikanth, Shashank, Mithun Babu, Houman Masnavi, Arun Kumar Singh, Karl Kruusamäe, and K. Madhava Krishna. “

Fast Adaptation of Manipulator Trajectories to Task PerturbationBy Differentiating through the Optimal Solution.”ArXiv:2011.00488, November 1, 2020This isn’t actually building a hard-coded optimizer into a gradient-based learning architecture—instead, it’s better described as building a bit of gradient-based learning into a hard-coded optimizer. Tangential but still interesting.

Energy systems

Chen, Bingqing, Zicheng Cai, and Mario Bergés. “Gnu-RL: A Precocial Reinforcement Learning Solution for Building HVAC Control Using a Differentiable MPC Policy.” In

Proceedings of the 6th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation, 316–25. BuildSys ’19. New York, NY, USA: Association for Computing Machinery, 2019An intriguing approach to something like ELK is done here, where an optimal policy parameterized by unknowns about the actual dynamics of the building/HVAC system is trained with imitation learning from expert demonstrations, and thereby learns the parameters of the system dynamics.

Chen, Yize. “Learning to Operate a Sustainable Power System.” PhD thesis, University of Washington, 2021.

Convex optimization layers are applied to solve for efficient power generation/transmission choices, while the neural-network layers are used to model the unpredictable output of renewable sources.

Here are some other potentially relevant papers I haven’t processed yet:

Ma, Hengbo, Bike Zhang, Masayoshi Tomizuka, and Koushil Sreenath. “Learning Differentiable Safety-Critical Control Using Control Barrier Functions for Generalization to Novel Environments.”

ArXiv:2201.01347 [Cs, Eess], January 7, 2022. http://arxiv.org/abs/2201.01347.Rojas, Junior, Eftychios Sifakis, and Ladislav Kavan. “Differentiable Implicit Soft-Body Physics.”

ArXiv:2102.05791 [Cs], September 9, 2021. http://arxiv.org/abs/2102.05791.Srinivas, Aravind, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. “Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control.” In

Proceedings of the 35th International Conference on Machine Learning, 4732–41. PMLR, 2018. https://proceedings.mlr.press/v80/srinivas18b.html.Among computational constraints, I think the most significant/fundamental are, in order,

Semicomputability

P (polynomial time)

PSPACE

Computability

BPP

NP

Δ11 (first-order hypercomputable)

All the rest (BQP, PP, RP, EXPTIME, etc)

For various collective-epistemics and cooperative-decision-making endeavours, I think a key technical enabler might be DVCS for structured data. To that end, I am interested in funding work in this direction. Aside from being in a position to allocate some funding, I think I have some comparative advantage in a broad inside-view awareness of potentially relevant theoretical footholds, and this post is intended to start unfurling that inside view, initially as a list of links. People who I fund to work on this should read the abstracts of all of these papers, pay special attention to those marked with (!), skim/read further as they see fit, and cite them in their writeups (at least under “related work”). I’m posting this here as part of a general policy of using this platform for any of my even-vaguely-technical output that goes beyond tweets.

(!) A Categorical Theory of Patches

(!) The CALM Theorem: When Distributed Consistency is Easy

Weaker Forms of Monotonicity for Declarative Networking: A More Fine-Grained Answer to the CALM-Conjecture

Logic and Lattices for Distributed Programming

(!) Double-pushout-rewriting of C-sets

This paper describes the theory and implementation of an operation (double-pushout-rewriting a.k.a. DPO-rewriting) on a data model (C-sets) which is what I’m currently most excited about exploring as a notion of “patch” (or “commit”) in DVCS of structured data.

(!) One specific type of structured data I’m particularly interested in applying the proposed work to is

hypernets, a combinatorial representation of string diagrams, defined alongside a double-pushout-rewriting scheme for them in section 4 of Functorial String Diagrams for Reverse-mode Automatic Differentiation. Ideally, this should be a special case of the above.Another one I’m interested in is Polynomial Circuits, which can be represented as hypernets.

A Generalized Concurrent Rule Construction for Double-Pushout Rewriting

Describes when DPO rewrite rules are compatible in the sense that they can be merged into a single concurrent rule.

Replication-Aware Linearizability

This paper introduces a consistency criterion that is weak enough to be satisfied by most CRDTs, but stronger than mere eventual consistency. I would expect any DVCS to probably satisfy this criterion.

(!) Making CRDTs Byzantine Fault Tolerant

This short paper outlines some simple, git-like mechanisms for rejecting invalid messages that I would want to see in basically

anydecentralized system.(!) Fixing incremental computation: Derivatives of fixpoints & the recursive semantics of Datalog

I would like the system to support incremental updating of the outputs of functions defined on the structured data (a.k.a. “views” in the database sense), and I think this paper describes a good building block for that. I have flagged it for special attention because I want compatibility with the Datalog model to be in the back of our minds as we make design decisions.

Knowledge Representation in Bicategories of Relations

This is an alternative data-model where database states are functors from C→Rel instead of functors from C→Set. This is in some ways closer to the relational database model, so there might be some interest in generalizing the C-Set DPO algorithm to “C-Rel”s. However, morphisms in Rel can be represented as spans in Set, and that’s probably easier.

Datalog: Bag Semantics via Set Semantics

Algebraic Property Graphs

Concise, Type-Safe, and Efficient Structural Diffing

“Concern, Respect, and Cooperation” is a contemporary moral-philosophy book by Garrett Cullity which advocates for a pluralistic foundation of morality, based on three distinct principles:

Concern: Moral patients’ welfare calls for promotion, protection, sensitivity, etc.

Respect: Moral patients’ self-expression calls for non-interference, listening, address, etc.

Cooperation: Worthwhile collective action calls for initiation, joining in, collective deliberation, sharing responsibility, etc. And one bonus principle, whose necessity he’s unsure of:

Protection: Precious objects call for protection, appreciation, and communication of the appreciation.

What I recently noticed here and want to write down is a loose correspondence between these different foundations for morality and some approaches to safe superintelligence:

CEV-maximization corresponds to finding a good enough definition of human welfare that Comcern alone suffices for safety.

Corrigibility corresponds to operationalizating some notion of Respect that would alone suffice for safety.

Multi-agent approaches lean in the direction of Cooperation.

Approaches that aim to just solve literally the “superintelligence that doesn’t destroy us” problem, without regard for the cosmic endowment, sometimes look like Protection.

Cullity argues that none of his principles is individually a satisfying foundation for morality, but that all four together (elaborated in certain ways with many caveats) seem adequate (and maybe just the first three). I have a similar intuition about AI safety approaches. I can’t yet make the analogy precise, but I feel worried when I imagine corrigibility alone, CEV alone, bargaining alone (whether causal or acausal), or Earth-as-wildlife-preserve; whereas I feel pretty good imagining a superintelligence that somehow balances all four. I can imagine that one of them might suffice as a foundation for the others, but I think this would be path-dependent at best. I would be excited about work that tries to do for Cullity’s entire framework what CEV does for pure single-agent utilitarianism (namely, make it more coherent and robust and closer to something that could be formally specified).

Useful primitives for incentivizing alignment-relevant metrics without compromising on task performance might include methods like Orthogonal Gradient Descent or Averaged Gradient Episodic Memory, evaluated and published in the setting of continual learning or multi-task learning. Something like “answer questions honestly” could mathematically be thought of as an additional task to learn, rather than as an inductive bias or regularization to incorporate. And I think these two training modifications are quite natural (I just came to essentially the same ideas independently and then thought “if either of these would work then surely the multi-task learning folks would be doing them?” and then I checked and indeed they are). Just some more nifty widgets to add to my/our toolbox.

Re: alignment tasks in multi-task settings. I think this makes a lot of sense. Especially in worlds where we have a lot of ML/AI systems doing a bunch of different things, even if they have very different specific tasks, the “library” of alignment objectives is probably widely shared.

What happens if you ask a logical inductor whether ππππ is an integer? What’s your own subjective probability?

This is one question where my odds are pretty far from being coherently Bayesian; I would bet against only up to around 100:1, but I wouldn’t bet in favor until somewhere around 8e7:1.

Why is that not Bayesian? The decision to bet is going to include a term for your counterparty’s willingness to bet, which is some evidence.

One way to overcome this in thought experiments is to frame it as a line with no spread, and no neutral option. At, say, 150k:1, which side would you bet? Then adjust to where you’d SET a line where you had to accept a wager on either side.