# The Pragmascope Idea

Pragma (Greek): thing, object.

A “pragmascope”, then, would be some kind of measurement or visualization device which shows the “things” or “objects” present.

I currently see the pragmascope as *the* major practical objective of work on natural abstractions. As I see it, the core theory of natural abstractions is now 80% nailed down, I’m now working to get it across the theory-practice gap, and the pragmascope is the big milestone on the other side of that gap.

This post introduces the idea of the pragmascope and what it would look like.

## Background: A Measurement Device Requires An Empirical Invariant

First, an aside on developing new measurement devices.

### Why The Thermometer?

What makes a thermometer a good measurement device? Why is “temperature”, as measured by a thermometer, such a useful quantity?

Well, at the most fundamental level… we stick a thermometer in two different things. Then, we put those two things in contact. Whichever one showed a higher “temperature” reading on the thermometer gets colder, whichever one showed a lower “temperature” reading on the thermometer gets hotter, all else equal (i.e. controlling for heat exchanged with other things in the environment). And this is robustly true across a huge range of different things we can stick a thermometer into.

It didn’t have to be that way! We could imagine a world (with very different physics) where, for instance, heat always flows from red objects to blue objects, from blue objects to green objects, and from green objects to red objects. But we don’t see that in practice. Instead, we see that each system can be assigned a single number (“temperature”), and then when we put two things in contact, the higher-number thing gets cooler and the lower-number thing gets hotter, regardless of which two things we picked.

Underlying the usefulness of the thermometer is an *empirical* fact, an *invariant*: the fact that which-thing-gets-hotter and which-thing-gets-colder when putting two things into contact can be predicted from a single one-dimensional real number associated with each system (i.e. “temperature”), for an extremely wide range of real-world things.

Generalizing: a useful measurement device starts with identifying some empirical invariant. There needs to be a wide variety of systems which interact in a predictable way across many contexts, *if* we know some particular information about each system. In the case of the thermometer, a wide variety of systems get hotter/colder when in contact, in a predictable way across many contexts, *if* we know the temperature of each system.

So what would be an analogous empirical invariant for a pragmascope?

### The Role Of The Natural Abstraction Hypothesis

The natural abstraction hypothesis has three components:

Chunks of the world generally interact with far-away chunks of the world via relatively-low-dimensional summaries

A broad class of cognitive architectures converge to use subsets of these summaries (i.e. they’re instrumentally convergent)

These summaries match human-recognizable “things” or “concepts”

For purposes of the pragmascope, we’re particularly interested in claim 2: a broad class of cognitive architectures converge to use subsets of the summaries. If true, that sure sounds like an empirical invariant!

So what would a corresponding measurement device look like?

## What would a pragmascope look like, concretely?

The “measurement device” (probably a python function, in practice) should take in some cognitive system (e.g. a trained neural network) and maybe its environment (e.g. simulator/data), and spit out some data structure representing the natural “summaries” in the system/environment. Then, we should easily be able to take some *other* cognitive system trained on the same environment, extract the natural “summaries” from that, and compare. Based on the natural abstraction hypothesis, we expect to observe things like:

A broad class of cognitive architectures trained on the same data/environment end up with subsets of the same summaries.

Two systems with the same summaries are able to accurately predict the same things on new data from the same environment/distribution.

On inspection, the summaries correspond to human-recognizable “things” or “concepts”.

A system is able to accurately predict things involving the same human-recognizable concepts the pragmascope says it has learned, and cannot accurately predict things involving human-recognizable concepts the pragmascope says it has not learned.

It’s these empirical observations which, if true, will underpin the usefulness of the pragmascope. The more precisely and robustly these sorts of properties hold, the more useful the pragmascope. Ideally we’d even be able to *prove* some of them.

### What’s The Output Data Structure?

One obvious currently-underspecified piece of the picture: what data structures will the pragmascope output, to represent the “summaries”? I have some current-best-guesses based on the math, but the main answer at this point is “I don’t know yet”. I expect looking at the internals of trained neural networks will give lots of feedback about what the natural data structures are.

Probably the earliest empirical work will just punt on standard data structures, and instead focus on translating internal-concept-representations in one net into corresponding internal-concept-representations in another. For instance, here’s one experiment I recently proposed:

Train two nets, with different architectures (both capable of achieving zero training loss and good performance on the test set), on the same data.

Compute the small change in data dx which would induce a small change in trained parameter values d\theta along each of the narrowest directions of the ridge in the loss landscape (i.e. eigenvectors of the Hessian with largest eigenvalue).

Then, compute the small change in parameter values d\theta in the

*second*net which would result from the same small change in data dx.Prediction: the d\theta directions computed will approximately match the narrowest directions of the ridge in the loss landscape of the second net.

Conceptually, this sort of experiment is intended to take all the stuff one network learned, and compare it to all the stuff the other network learned. It wouldn’t yield a full pragmascope, because it wouldn’t say anything about how to factor all the stuff a network learns into individual concepts, but it would give a very well-grounded starting point for translating stuff-in-one-net into stuff-in-another-net (to first/second-order approximation).

Question 1: What’s the minimal set of articles one should read to understand this 80%?

Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that “abstraction” (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sample complexity (and better yet if you can tell something about the computational complexity), in a manner that cannot be achieved if you discard any of the important-according-to-you pieces. You might have some different approach to this, but I’m not sure what it is.

Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.

If we want to show that abstraction is a crucial ingredient of learning/cognition, then “Can we efficiently learn hypothesis classes defined in terms of abstraction hierarchies, as captured by John’s formalism?” is entirely the wrong question. Just because something

canbe learned efficiently doesn’t mean it’s convergent for a wide variety of cognitive systems. And even if such hypothesis classes couldn’t be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems’ cognition.The question we actually want here is “Is abstraction, as captured by John’s formalism, instrumentally convergent for a wide variety of cognitive systems?”. And that question is indeed not yet definitively answered. The pragmascope itself would largely allow us to answer that question empirically, and I expect the ability to answer it empirically will quickly lead to proofs as well.

Thank you!

I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don’t know how this prior looks like, but it’s something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer to that “holy grail prior”, in terms of richness/efficiency, compared to priors that we already know (e.g. MDPs with small number of states which are not rich enough, or the Solomonoff prior which is both statistically and computationally intractable), that would at least be

evidencethat you’re going in the right direction.Hmm, I’m not sure what would it mean for a subset of a hypothesis class to be “convergent”.

That’s interesting, but I’m still not sure what it means exactly. Let’s say we take a reinforcement learner which a specific hypothesis class, such all MDPs of certain size, or some family of MDPs with low eluder dimension, or the actual AIXI. How would you determine whether your formalism is “instrumentally convergent” for each of those? Is there a rigorous way to state the question?

Doesn’t the necessity of abstraction follow from size concerns? The alternative to abstraction would be to measure and simulate everything in full detail, which can only be done if you are “exponentially bigger than the universe” (and have exponentially many universes to learn from).

One could argue that

some kind ofabstraction is necessary due to size concerns, but that alone does not necessarily nail down my whole formalism.From the existing theory, I still have a hard time seeing what you would be able to get out of this in practice, without adding further structures. That’s not to say I think the idea is doomed, I have some ideas for what I’d attempt to do as further structures (like existing human geometric knowledge), but it doesn’t seem like you plan on using those.

I’m not sure to which extent my confusion is because you still plan to learn new things when experimenting with networks as you apply this pragmascope, vs you being more optimistic about what the existing theory allows you to do. Vs other things I might not realize.

I definitely expect to learn a lot from networks when experimenting.

Okay, really rough idea on how to identify where a ML model’s goals are stored + measure how much of an optimizer it is. If successful, it might provide a decent starting point for disentangling concepts from each other.

The Ground of Optimization mentions “retargetability” as one of the variables of optimizing systems. How much of the system do you need to change in order to make it optimize towards a different target configuration? Can you easily split the system into the optimizer and the optimized? For example: In a house-plus-construction-company system, we just need to vary the house’s schematics to make the system optimize towards wildly different houses. Conversely, to make a ball placed at the edge of a giant inverted cone come to rest in a different location, we’d need to change the shape of the entire cone.

Intuitively, it seems like it should be possible to identify goals in neural networks the same way. A “goal” is the minimal set of parameters that you need to perturb in order to make the network optimize a meaningfully different metric without any loss of capability.

Various shallow pattern-matchers/look-up tables are not easily retargetable — you’d need to rewrite most of their parameters. They’re more like inverted cones.

Idealized mesa-optimizers with a centralized crystallized mesa-objective are very retargetable — their utility function is precisely mathematically defined, disentangled from capabilities, and straightforwardly rewritten.

Intermediate systems — e. g., shard economies/heuristics over world-models are

somewhatretargetable. There may be limited dimensions along which their mesa-objectives may be changed without capability loss, limited “angles” in concept-space by which their targeting may be adjusted. Alternatively/additionally, you’d need to rewrite the entire suite of shards/heuristics at once and in a cross-dependent manner.As a bonus, the fraction of parameters you need to change to retarget the system roughly tells you how much of an optimizer it is.

The question is how to implement this. It’s easy to imagine the algorithms that may work if we had infinite compute, but practically?

Neuron Shapleys may be a good starting point? The linked paper seems to “use the Shapley value framework to measure the importance of different neurons in determining an arbitrary metric of the neural net output”, and the authors use it to tank accuracy/remove social bias/increase robustness to adversarial attacks just by rewriting a few neurons. It might be possible to do something similar to detect goal-encoding neurons? Haven’t looked into it in-depth yet, though.

Neat idea. One thing I’d watch out for is that “subset of the neurons” might not be the right ontology for a conceptually-”small” change. E.g. in the Rome paper, they made low-rank updates rather than work with individual neurons. So bear in mind that figuring out the ontology through which to view the network’s internals may itself be part of the problem.

Can you unroll that?

“Small change in data” = one additional training sample is slightly modified? “Induce” = via an SGD update step on that additional training sample? Why is there a ridge in the loss landscape? What are “the narrowest directions”?

The easiest operationalization starts from the assumption that we train to zero loss. From there, we can calculate the small change in optimal parameter values dθ due to a small change in all the data dx:

(−∑ndfndθd2Ln(dfn)2dfndθ)dθ=∑ndfndθd2Ln(dfn)2dfndxndxn

… where:

fn(θ,xn) is the network output on datapoint n

Ln(fn) is the loss on datapoint n

(More generally, when calculating maxθu(θ,x), the change in optimal θ-value from a small change in x is given by d2udθ2dθ=−d2udθdxdx.)

The “narrowest directions” are the eigenvectors of the loss Hessian with largest eigenvalue (where the loss Hessian is ∑ndfndθd2Ln(dfn)2dfndθ, i.e. the matrix on the LHS in the formula above). And there’s a ridge in the loss landscape because, if we’re training to zero loss, then presumably we’re in the overparameterized regime.