Peter S. Park

Karma: 132

Peter S. Park 8 Jul 2022 2:03 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: Race Along Rashomon Ridge
Thanks so much, Charlie, for reading the post and for your comment! I really appreciate it.

I think both ways to prune neurons and ways to make the neural net more sparse are very promising steps towards constructing a simultaneously optimal and interpretable model.

I completely agree that alignment of the neuron basis with human-interpretable classifications of the data would really help interpretability. But if only a subset of the neuron basis are aligned with human-interpretability, and the complement comprises a very large subset of abstractions (which, necessarily, people would not be able to learn to interpret), then we haven’t made the model interpretable.
Suppose 100% is the level of interpretability we need for guaranteed alignment (which I am convinced of, because even 1% uninterpretability can screw you over). Then low-dimensionality seems like a necessary, but not sufficient condition for intepretability. It is possible, but not always true, that each of a small number of abstractions will either already familiar to people or can be learned by people in a reasonable amount of time.

Peter S. Park 8 Jul 2022 2:17 UTC
3 points
0
in reply to: harfe’s comment on: Race Along Rashomon Ridge
Thank you so much for this suggestion, tgb and harfe! I completely agree, and this was entirely my error in our team’s collaborative post. The fact that the level sets of submersions are nice submanifolds has nothing to do with the level set of global minimizers.
I think we will be revising this post in the near future reflecting this and other errors.
(For example, the Hessian tells you what the directions whose second-order penalty to loss are zero, but it doesn’t necessarily tell you about higher-order penalties to loss, which is something I forgot to mention. A direction that looks like zero-loss when looking at the Hessian may not actually be not actually be zero-loss if it applies, say, a fourth-order penalty to the loss. This could only be probed by a matrix of fourth derivatives. But I think a heuristic argument suggests that a zero-eigenvalue direction of the Hessian should almost always be an actual zero-loss direction. Let me know if you buy this!)

Peter S. Park 8 Jul 2022 2:50 UTC
2 points
1
in reply to: Thomas Larsen’s comment on: Race Along Rashomon Ridge
Edit: Adding a link to “Git Re-Basin: Merging Models modulo Permutation Symmetries,” a relevant paper that has recently been posted on arXiv.
Thank you so much, Thomas and Buck, for reading the post and for your insightful comments!

It is indeed true that some functions have two global minimizers that are not path-connected. Empirically, very overparametrized models which are trained on “non-artificial” datasets (“datasets from nature”?) seem to have a connected Rashomon manifold. It would definitely be helpful to know theoretically why this tends to happen, and when this wouldn’t happen.

One heuristic argument for why two disconnected global minimizers might only happen in “artificial” datasets might go something like this. Given two quantities, one is larger than the other, unless there is a symmetry-based reason why they are actually secretly the same quantity. Under this heuristic, a non-overparametrized model’s loss landscape has a global minimum achieved by precisely one point, and potentially some suboptimal local minima as well. But overparametrizing the model makes the suboptimal local minima not local minima anymore (by making them saddle points?) while the single global minimizer is “stretched out” to a whole submanifold. This “stretching out” is the symmetry; all optimal models on this submanifold are secretly the same.

One situation where this heuristic fails is if there are other types of symmetry, like rotation. Then, applying this move to a global minimizer could get you other global minimizers which are not connected to each other. In this case, “modding out by the symmetry” is not decreasing the dimension, but taking the quotient by the symmetry group which gives you a quotient space of the same dimension. I’m guessing these types of situations are more common in “artificial” datasets which have not modded out all the obvious symmetries yet.

Peter S. Park 26 Jul 2022 3:39 UTC
4 points
0
in reply to: Charlie Steiner’s comment on: Finding Skeletons on Rashomon Ridge
Thanks so much for your insightful comment, Charlie! I really appreciate it.
I think you totally could do this. Even if it is rare, it can occur with positive probability.
For example, my model of how natural selection (genetic algorithms, not SGD) consistently creates diversity is that with sufficiently many draws of descendents, one of the drawn descendents could have turned off the original model and turned on another model in a way that comprises a neutral drift.

Peter S. Park 13 Aug 2022 21:48 UTC
4 points
2
in reply to: johnswentworth’s comment on: How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)
I strongly agree with John that “what we really want to do is to not build a thing which needs to be boxed in the first place.” This is indeed the ultimate security mindset.

I also strongly agree that relying on a “fancy,” multifaceted box that looks secure due to its complexity, but may not be (especially to a superintelligent AGI), is not security mindset.

One definition of security mindset is “suppose that anything that could go wrong, will go wrong.” So, even if we have reason to believe that we’ve achieved an aligned superintelligent AGI, we should have high-quality (not just high-quantity) security failsafes, just in case our knowledge does not generalize to the high-capabilities domain. The failsafes would help us efficiently and vigilantly test whether the AGI is indeed as aligned as we thought. This would be an example of a security mindset against overconfidence in our current assumptions.

Peter S. Park 19 Aug 2022 3:13 UTC
6 points
4
on: Interpretability Tools Are an Attack Channel
This is indeed a vital but underdiscussed problem. My SERI MATS team published a post about a game-theoretic model of alignment where the expected scientific benefit of an interpretability tool can be weighed against its expected cost due to its enabling of AGI escape risks. The expected cost can be reduced by limiting the capabilities of the AGI and by increasing the quality of security, and the expected scientific benefit can be increased by prioritizing informational efficiency of the interpretability tool.
Conditional on an organization dead set on building a superintelligent AGI (which I would strongly oppose, but may be forced to help align if we cannot dissuade the organization in any way), I think efforts to apply security, alignment, and positive-EV interpretability should be targeted at all capability levels, both high and low. Alignment efforts at high-capability levels run into the issue of heightened AGI escape risk. Alignment efforts at low-capability levels run into the issue that alignment gains, if any, may phase-transition out of existence after the AGI moves into a higher-capability regime. We should try our best at both and hope to get lucky.

Peter S. Park 5 Sep 2022 5:29 UTC
3 points
1
on: Private alignment research sharing and coordination
This is an excellent idea. An encrypted, airgapped, or paper library that coordinates between AI researchers seems crucial for AGI safety.
This is because we should expect in the worst-case scenario that AGI will be trained on the whole Internet, including any online discussion of our interpretability tools, security reserach, and so on. This is information that the AGI can use against us (e.g., by using our interpretability tools against us, to hack, deceive, or otherwise socially engineer the alignment researchers).
Security through obscurity can buy us more chances at aligning/retraining the AGI before it escapes into the Internet. We should keep our battle plans close to our chest, instead of posting it online for the AGI to see.
What links here?
- Peter S. Park's comment on The Defender’s Advantage of Interpretability by Marius Hobbhahn (14 Sep 2022 19:25 UTC; 2 points)

Peter S. Park 5 Sep 2022 22:52 UTC
4 points
2
in reply to: porby’s comment on: Private alignment research sharing and coordination
In general, it is much easier to keep potentially concerning material out of the AGI’s training set if it’s already a secret rather than something that’s been published on the Internet. This is because there may be copies, references, and discussions of the material elsewhere in the training set that we fail to catch.

If it’s already posted on the Internet and it’s too late, we should of course still try our best to keep it out of the training set.

As for the question of “should we give up on security after AGI attains high capabilities?” we shouldn’t give up as long as our preparation could non-negligibly increase our probability of escaping doom, even if the probability increase is small. We should always maximize expected utility, even if we are probably doomed.

Peter S. Park 8 Sep 2022 20:57 UTC
2 points
1
in reply to: catubc’s comment on: Can We Align a Self-Improving AGI?
Thank you so much for your kind words! I really appreciate it.
One definition of alignment is: Will the AI do what we want it to do? And as your post compellingly argues, “what we want it to do” is not well-defined, because it is something that a powerful AI could be able to influence. For many settings, using a term that’s less difficult to rigorously pin down, like safe AI, trustworthy AI, or corrigible AI, could have better utility.
I would definitely count the AI’s drive towards self-improvement as a part of the College Kid Problem! Sorry if the post did not make that clear.

Peter S. Park 14 Sep 2022 19:25 UTC
2 points
−7
on: The Defender’s Advantage of Interpretability
Thank you so much, Marius, for writing this pertinent post! The question of whether a given interpretability tool will help us or hurt us in expectation is an extremely important one.
The answer, however, differs on a situation-to-situation basis. The scientific benefit of an interpretability tool (more generally, of any information channel) is difficult to estimate a priori, but likely is tied to its informational efficiency. Roughly speaking, how much informational value can the interpretability tool/information channel yield per unit of bitrate?
The cost of an interpretability tool include the risk of capabillities gains (if AI capabilities orgs know about it). It also includes the risk that a superintelligent AGI can use the interpretability tool/information channel for its own purposes, such as by hacking or socially engineering the alignment researchers for its own purposes (i.e., probably to escape into the Internet and then power-seek).
This perspective suggests that an interpretability tool/information channel will be disadvantageous for the defender if (1) the informational efficiency of the channel is low, (2) the interpreted AI has high capabilities (or the interpretability tool helps capabilities organizations increase the model’s capabilities), and (3) its box setup has weak cybersecurity.
Keeping interpretability research off the Internet (such as via an encrypted or airgapped library that coordinates between AI researchers) is a low-hanging fruit that would make dual-use interpretability tools more likely to be helpful to humanity rather than counterproductive.
To put it bluntly, once an interpretability tool (the “battle plan”) is posted online, we should assume in the worst-case scenario that the battle plan has just been given to the superintelligent AGI. First, the AGI will be misaligned in the worst-case scenario, and this scenario may in fact occur with high probability. Second, the likely misaligned AGI will probably be trained on the whole Internet, and thus will already know how the tool works, what we expect to see from it, and how to manipulate the tool for the AGI’s own purposes.
As for the six proposed arguments why “interpretability has a defenders advantage,” I am uncertain that arguments 1-5 will hold true for the interpretability tools that will be ready by the time AGI emerges. I agree with argument 6, but in a war against misalignment, I wouldn’t unconditionally assume that any battle plan which cannot guarantee a complete victory is a bad plan. In fact, a plan that tries to guarantee a complete victory but makes the outcome worse on expectation is probably a bad plan. In contrast, a plan that cannot guarantee a complete victory but makes the outcome better on expectation is probably a good plan.
Personally, I am most optimistic about interpretability tools of high informational efficiency that have not yet been posted on the Internet. We should try really hard to develop and use high-quality interpretability tools. However, using a high quantity of low-quality (or publicly posted) interpretability tools may actually decrease the odds of human survival.

Peter S. Park 3 Nov 2022 22:45 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Why do we post our AI safety plans on the Internet?
Thanks so much for your helpful comment, Charlie! I really appreciate it.
It is likely that our cruxes are the following. I think that (1) we probably cannot predict the precise moment the AGI becomes agentic and/or dangerous, (2) we probably won’t have a strong credence that a specific alignment plan will succeed, and (3) AGI takeoff will be slow enough that secrecy can be a key difference-maker in whether we die or not.
So, I expect we will have alignment plan numbers 1, 2, 3, and so on. We will try alignment plan 1, but it will probably not succeed (and hopefully we can see signs of it not succeeding early enough that we shut it down and try alignment plan 2). If we can safely empirically iterate, we will find an alignment plan N that works.
This is risky and we could very well die (although I think the probability is not unconditionally 100%). This is why I think not building AGI is by far the best strategy (Corollary: I place a lot of comparative optimism on AI governance and coordination efforts.). The above discussion is conditional on trying to build an aligned AGI.
I think with extensive discussion, planning, and execution, we can have a Manhattan-Project-esque shift in research norms that maintains much of the ease-of-research for us AI safety researchers, but achieves secrecy and thereby valuable AI safety plans. If this can be achieved with a not-too-high of a resource cost, I think it is likely a good idea: and I think there is at least a small probability that it will “result in an x-risk reduction that is, on a per-dollar level, maximal among past and current EA projects on x-risk reduction.”

Peter S. Park 5 Nov 2022 22:57 UTC
1 point
1
in reply to: Zach Stein-Perlman’s comment on: Instead of technical research, more people should focus on buying time
Some practical ideas of how to achieve this (and a productive debate in the comments section of the risks from low-quality outreach efforts) can be found in my related forum post from earlier: https://forum.effectivealtruism.org/posts/juhMehg89FrLX9pTj/a-grand-strategy-to-recruit-ai-capabilities-researchers-into

Peter S. Park 6 Nov 2022 0:07 UTC
2 points
1
in reply to: Emrik’s comment on: Instead of technical research, more people should focus on buying time
Please check out my writeup from April! https://forum.effectivealtruism.org/posts/juhMehg89FrLX9pTj/a-grand-strategy-to-recruit-ai-capabilities-researchers-into

Peter S. Park 6 Nov 2022 1:26 UTC
1 point
0
in reply to: Emrik’s comment on: Instead of technical research, more people should focus on buying time
That’s totally fair!

The part of my post I meant to highlight was the last sentence: “To put it bluntly, we should—on all fronts—scale up efforts to recruit talented AI capabilities researchers into AI safety research, in order to slow down the former in comparison to the latter. ”

Perhaps I should have made this point front-and-center.

Peter S. Park 6 Nov 2022 1:53 UTC
1 point
0
in reply to: Emrik’s comment on: Instead of technical research, more people should focus on buying time
I think the point of Thomas, Akash, and Olivia’s post is that more people should focus on buying time, because solving the AI safety/alignment problem before capabilities increase to the point of AGI is important, and right now the latter is progressing much faster than the former.
See the first two paragraphs of my post, although I could have made its point and the implicit modeling assumptions more explicitly clear:
“AI capabilities research seems to be substantially outpacing AI safety research. It is most likely true that successfully solving the AI alignment problem before the successful development of AGI is critical for the continued survival and thriving of humanity.
Assuming that AI capabilities research continues to outpace AI safety research, the former will eventually result in the most negative externality in history: a significant risk of human extinction. Despite this, a free-rider problem causes AI capabilities research to myopically push forward, both because of market competition and great power competition (e.g., U.S. and China). AI capabilities research is thus analogous to the societal production and usage of fossil fuels, and AI safety research is analogous to green-energy research. We want to scale up and accelerate green-energy research as soon as possible, so that we can halt the negative externalities of fossil fuel use.”
If the “multiplier effects” framing helped you update, then that’s really great! (I also found this framing helpful when I wrote it in this summer at SERI MATS, in the Alignment Game Tree group exercise for John Wentworth’s stream.)
I do think that in order for the “multiplier effects” explanation to hold, it needs to slow down capabilities research relative to safety research. Doing the latter with maximum efficiency is the core phenomenon that proves the optimality of the proposed action, not the former.

Peter S. Park 17 Nov 2022 8:32 UTC
4 points
3
in reply to: Erik Jenner’s comment on: The limited upside of interpretability
Thank you so much, Erik, for your detailed and honest feedback! I really appreciate it.
I agree with you that it is obviously true that we won’t be able to make detailed predictions about what an AGI will do without running it. In other words, the most efficient source of information will be empiricism in the precise deployment environment. The AI safety plans that are likely to robustly help alignment research will be those that make empiricism less dangerous for AGI-scale models. Think BSL-4 labs for dangerous virology experiments, which would be analogous to airgapping, sandboxing, and other AI control methods.
I am not completely pessimistic about interpretability of coarse-grained information, although still somewhat pessimistic. Even in systems neuroscience, interpretability of coarse-grained information has seen some successes (in contrast to interpretability of fine-grained information, which has seen very little success).
I agree that if the interpretability researcher is extremely lucky, they can extract facts about the AI that lets them make important coarse-grained predictions with only a short amount of time and computational resources.
But as you said, this is an unrealistically optimistic picture. More realistically, the interpretability researcher will not be magically lucky, which means we should expect the rate at which prediction-enhancing information is obtained to be inefficient.
And given that information channels are dual-use (in that the AGI can also use them for sandbox escape), we should prioritize efficient information channels like empiricism, rather than inefficient ones like fine-grained interpretability. Inefficient information channels can be net-negative, because they may be more useful for the AGI’s sandbox escape compared to their usefulness to alignment researchers.
Perhaps to demonstrate that this is a practical concern rather than just a theoretical concern, let me ask the following. In your model, why did the Human Brain Project crash and burn? Should we expect interpreting AGI-scale neural nets to succeed where interpreting biological brains failed?

Peter S. Park 17 Nov 2022 11:29 UTC
3 points
2
on: Current themes in mechanistic interpretability research
Thank you very much for the detailed and insightful post, Lee, Sid, and Beren! I really appreciate it.
In the spirit of full communication, I’m writing to share my recent argument that mechanistic interpretability may not be a reliable safety plan for AGI-scale models.
It would be really helpful to hear your thoughts on it!

Peter S. Park 19 Nov 2022 8:47 UTC
2 points
1
in reply to: beren’s comment on: The limited upside of interpretability
Thank you so much for your insightful and detailed response, Beren! I really appreciate your time.
The cruxes seem very important to investigate.
This seems especially likely to me if the AGIs architecture is hand-designed by humans – i.e. there is a ‘world model’ part and a ‘planner’ part and a ‘value function’ and so forth.
It probably helps to have the AGI’s architecture hand-designed to be more human-interpretable. My model is that on the spectrum of high-complexity paradigms (e.g., deep learning) to low-complexity paradigms (e.g., software design by a human software engineer), having the AGI’s architecture be hand-designed moves away from the former and towards the latter, which helps reduce computational irreducibility and thereby increase out-of-distribution predictability (e.g., on questions like “Is the model deceptive?”).
However, my guess is that in order for out-of-distribution predictability of the system to be nontrivial, one would need to go substantially towards the low-complexity end of the spectrum. This would make it unlikely for the model to achieve AGI-like capabilities.
What we ultimately likely want is a statistical-mechanics-like theory of how do neural nets learn representations which includes what circuits/specific computations they tend to do, how they evolve during training, what behaviours these give rise to, and how they behave off distribution etc.
It would be great if we can get a statistical-mechanics-like theory. Realistically, such a theory would probably be a combination of “small-to-middle-number systems because local interactions are quite relevant in affecting system behavior. In this case, there would be a general lack of the homogeneous mixing necessary for a purely statistical mechanics view to be applicable.”
Interpretability tools could let us perform highly targeted interventions on the system without needing to understand the full system.
I can’t rule it out (gene editing is the optimistic outside view here), but I expect such interventions to be imperfect and vulnerable to unforseeable side effects. These interventions would likely require immense trial-and-error to achieve nontrivial out-of-distribution predictability.
From my perspective, as someone who has worked in neuroscience is that the field just keeps chugging along accumulating knowledge and is advancing at a reasonable pace – i.e. our understanding of the brain has improved vastly since 20 years ago. Not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.
I agree with the characterization “not at the pace probably required for alignment on short timelines but at a decent level for a scientific field.” Indeed, the main point I wanted to make was my skepticism that interpretability research on AGI-scale models will progress fast enough to solve alignment in a short amount of time.
I think just relying on airgapping and other physical measures is a bad idea vs a superintelligence and in practice we should be trying to build up as many uncorrelated lines of defense as possible in the test-harness.
I completely agree, as long as the lines of defense are net-positive rather than net-negative. An information channel should only be opened if it helps the alignment researchers (to align the AGI) more than it helps the likely by-default misaligned AGI (to prematurely escape its sandbox).
This is an interesting objection. While it is possible that the AGI could hack interpretability channels, it seems much more difficult than just communicating in the normal way by running in the world – i.e. the ‘empirical channel’. This is because it probably doesn’t know (and we should be careful to make sure that it definitely does not know) what interpretability tools we are using on it, and that interpretability tools can be applied before we actually run the AGI (i.e. directly to weights) or to individual components in isolation.
My guess is that the reward-risk tradeoff of empirical channels—empiricism in the deployment environment or a precise replica of the deployment environment—is more likely to be favorable than that of interpretability channels. This is because the reward is high (empiricism efficiently yields predictive information) while ways to mitigate the risk are human-understandable (as opposed to “complexity is the enemy of security”).
If the AGI doesn’t have prior knowledge of our interpretability plans and how to exploit them (e.g., by deceiving us into a false sense of security), then I would be much more optimistic about the reward-risk tradeoff of interpretability channels! This is why we should keep our interpretability plans inacessible to the AGI as much as possible. Since there is plausibly a significant amount of secrecy value in interpretability plans, we should:
1. preemptively keep our interpretability plans off of the Internet,
2. filter these plans from the AGI’s training set, and
3. ensure that the AGI cannot inspect its own internal data.
I have been campaigning for these measures, but many AI safety researchers have generally opposed adopting Measure #1. I hope this will change going forward!

Peter S. Park 28 Nov 2022 22:27 UTC
2 points
1
in reply to: Esben Kran’s comment on: The limited upside of interpretability
Thank you so much for your detailed and insightful response, Esben! It is extremely informative and helpful.
So as far as I understand your text, you argue that fine-grained interpretability loses out against “empiricism” (running the model) because of computational intractability.
I generally disagree with this. beren points out many of the same critiques of this piece as I would come forth with. Additionally, the arguments seem too undefined, like there is not in-depth argumentation enough to support the points you make. Strong upvote for writing them out, though!
The main benefit of interpretability, if it can succeed, is that one can predict harmful future behavior (that would have occurred when deployed out-of-distribution) by probing internal data. This allows the researchers to preemptively prevent the harmful behavior: for example, by retraining after detecting deceptive intent. If this is scientifically possible, it would be a substantial benefit, especially since it is generally difficult to obtain out-of-distribution predictions from atheoretical empiricism.
However, I am skeptical that interpretability can achieve nontrivial success in out-of-distribution predictions, especially in the amount of time alignment researchers will realistically have. The reason is that deceptive intent is likely a fine-grained trait at the internal-data level (rather than at the behavioral level). Consequently, computational irreducibility is likely to impose a hard bound on predicting deceptive intent out-of-distribution, at least when assuming realistic amounts of time and resources.
My guess is that detecting deceptive intent solely from a neural net’s internal data is probably at least as fine-grained as behavioral genetics or neuroscience. These fields have made some progress, but preemptively predicting behavioral traits from internal data remains mostly unsolved.
For example, consider a question analogous to that of deceptive misalignment: ‘Is the given genome optimized for inclusive fitness, or is it optimized for a proxy goal that deviates from inclusive fitness in certain historically unprecedented environments?’ We know that evolutionary pressures select for maximizing inclusive fitness. However, the genome is optimized not for inclusive fitness, but for a proxy goal (survive and engage in sexual intercourse) that deviates from inclusive fitness in environments that are sufficiently distinct from ancestral environments.
How did scientists find out that the genome is optimized for a proxy goal? Almost entirely from behavior. We have a coarse-grained behavioral model that is quite good and generalizable. Evolution shaped animals’ behavior towards a drive for sexual intercourse, but historically unprecedented environmental changes (e.g., widespread availability of birth control) has made this proxy goal distinct from inclusive fitness. Parsimonious models based on first principles that are likely to be correct, like the above one, have a realistic chance of achieving situation-specific predictability that generalizes out-of-distribution.
In contrast, there is still very little understanding of which genes interact to cause animals’ sex drive. Which genes affect sex drive? Probably a substantial proportion of them, and they probably interact in interconnected and nonlinear ways (including with the extremely complex, multidimensionally varying environment) to produce behavioral traits in an unpredictable manner. Moreover, a lot of the information needed to predict behavioral traits like sex drive will lie in the specific environment and how it interacts with the genome. Only the most coarse-grained of these interaction dynamics will be predictable via bypassing empiricism with a statistical-mechanics-like model, due to computational irreducibility. And such a coarse-grained model will likely be rooted in behavior-based abstractions.
Deep-learning neural nets do come with an advantage lacked by behavioral genetics and neuroscience: a potentially complete knowledge of the internal data, the environmental data, and the data of their interaction throughout the whole training process.
But there is a missing piece: complete knowledge of the deployment environment. Any internals-based model of deceptive intent that alignment researchers can come up with is only guaranteed to hold in the subset of environments that the researchers have empirically tested. In the subset of environments that the polycausal model has not been tested in, there is no a priori reason that the model will generalize correctly. A barrier to generalizability is posed by the nonlinear and interconnected interactions between the neural net’s internals and the unprecedented environment, which can and likely will manifest differently depending on the environment. Relaxed adversarial training can help test a wider variety of environments, but this is still hampered by the blind spot of being unable to test the subset of environments that cannot be instantiated at human-level capabilities (e.g., the environment in which RSA encryption is broken). Thus, my guess is that the intrinsic out-of-distribution predictability of the AGI neural net’s behavior would be low, just like that of behavioral genetics or neuroscience.
For a conceptual example, consider the fact that the dynamics of cellular automata can change drastically with just one cell’s change in the initial conditions. See Figure 1 of Beckage et al. (Code 1599 in Wolfram’s A New Kind of Science), reproduced below:
In general, the only way to accurately ascertain how a computationally irreducible agent will behave in a complex environment is to run it in that environment. Even with complete knowledge of the agent’s internals, incomplete knowledge of the environment is sufficient to constrain a priori predictability. I expect that many predictions yielded by interpretability tools in the pre-deployment environment will fail to generalize to the post-deployment environment, unless the two are equal.
It seems premature to disregard the plausibility of the agenda itself, just as it is premature to disregard the project of neuroscience based on HBP. Now, arguing that it’s a matter of speed seems completely fine but this is another argument and isn’t emphasized in the text.
Sorry for the miscommunication! I meant to say that the rate at which mechanistic interpretability will yield useful, generalizable information is slow, not zero.
But this is sufficient for concern because informational channels are dual-use; the AGI can use it for sandbox escape. We should only open an interpretability channel if the rate of scientific benefit exceeds the rate of cost (risk of premature sandbox escape by a misaligned AGI).
My opinion is that while mechanistic interpretability has made some progress, the rate at which this progress is happening is not fast enough to solve alignment in a short amount of time and computational resources. So far, the rate of progress in interpretability research has been substantially outpaced by that in AI capabilities research. I think this was predictable, due to what we know about computational irreducibility.
Maybe it’s just my misunderstanding of what you mean by fine-grained interpretability, but we don’t need to figure out what neurons do, we literally design them. So the inspections happen at feature level, which is much more high-level than investigating individual neurons (sometimes these features seem represented in singular neurons of course). The circuits paradigm also generally looks at neural networks like systems neuroscience does, interpreting causal pathways in the models (of course with radical methodological differences because of the architectures and computation medium). The mechanistic interpretability project does not seem misguided by an idealization of neuron-level analysis and will probably adopt any new strategy that seems promising.
Roughly speaking, there is a spectrum between high-complexity paradigms of design (e.g., deep learning) and low-complexity, modular paradigms of design (e.g., software design by a human software engineer). My guess is that for many complex tasks, the optimal equilibrium strategy can be achieved only by the former, and attempting to meaningfully move towards the latter end of the spectum will result in sacrificing performance. For example, I expect that we won’t be able to build AGI via modular software design by a human software engineer, but that we will be able to build it by deep learning.
Again, thank you for the post and I always like when people cite McElreath, though I don’t see his arguments apply as well to interpretability since we don’t model neural networks with linear regression at all. Not even scaling laws use such simplistic modeling, e.g. see Ethan’s work.
In Ethan’s scaling law, extrapolatory generalization is only guaranteed to be valid locally (“perfectly extrapolate until the next break”), and not globally. This is completely consistent with my prior. My assertion was that in order to globally extrapolate empirical findings to an unknown deployment environment, only simple models have a nontrivial chance of working (assuming realistic amounts of time and computational resources). These simple models will likely be based on parsimonious first principles that we have strong reason to be valid even in the unknown environment. And consequently, they will likely be largely based on behavioral data rather than the internal data of the agent-environment interaction dynamics.

Peter S. Park 28 Nov 2022 22:41 UTC
3 points
2
in reply to: Beth Barnes’s comment on: The limited upside of interpretability
Thank you so much, Beth, for your extremely insightful comment! I really appreciate your time.
I completely agree with everything you said. I agree that “you can make many very useful predictions about humans by modeling them as (for example) having some set of goals, habits, tendencies, and knowledge,” and that these insights will be very useful for alignment research.
I also agree that “it’s difficult to identify what a human’s intentions are just by having access to their brain.” This was actually the main point I wanted to get across; I guess it wasn’t clearly communicated. Sorry about the confusion!
My assertion was that in order to predict the interaction dynamics of a computationally irreducible agent with a complex deployment environment, there are two realistic options:
1. Run the agent in an exact copy of the environment and see what happens.
2. If the deployment environment is unknown, use the available empirical data to develop a simplified model of the system based on parsimonious first principles that are likely to be valid even in the unknown deployment environment. The predictions yielded by such models have a chance of generalizing out-of-distribution, although they will necessarily be limited in scope.
When researchers try to predict intent from internal data, their assumptions/first principles (based on the limited empirical data they have) will probably not be guaranteed to be “valid even in the unknown deployment enviroment.” Hence, there is little robust reason to believe that the predictions based on these model assumptions will be generalizable out-of-distribution.