Let the wonder never fade!
Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.
Let the wonder never fade!
Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.
I think that RLHF is reasonably likely to be safer than prompt engineering: RLHF is probably a more powerful technique for eliciting your model’s capabilities than prompt engineering is. And so if you need to make a system which has some particular level of performance, you can probably achieve that level of performance with a less generally capable model if you use RLHF than if you use prompt engineering.
Wait, that doesn’t really follow. RLHF can elicit more capabilities than prompt engineering, yes, but how is that a reason for RLHF being safer than prompt engineering?
Is there a case for AI gain-of-function research?
(Epistemic Status: I don’t endorse this yet, just thinking aloud. Please let me know if you want to act/research based on this idea)
It seems like it should be possible to materialize certain forms of AI alignment failure modes with today’s deep learning algorithms, if we directly optimize for their discovery. For example, training a Gradient Hacker Enzyme.
A possible benefit of this would be that it gives us bits of evidence wrt how such hypothesized risks would actually manifest in real training environments. While the similarities would be limited because the training setups would be optimizing for their discovery, it should at least serve as a good lower bound for the scenarios in which these risks could manifest.
Perhaps having a concrete bound for when dangerous capabilities appear (eg a X parameter model trained in Y modality has Z chance of forming a gradient hacker) would make it easier for policy folks to push for regulations.
Is AI gain-of-function equally dangerous as biotech gain-of-function? Some arguments in favor (of the former being dangerous):
The malicious actor argument is probably stronger for AI gain-of-function.
if someone publicly releases a Gradient Hacker Enzyme, this lowers the resource that would be needed for a malicious actor to develop a misaligned AI (eg plug in the misaligned Enzyme at an otherwise benign low-capability training run).
Risky researcher incentive is equally strong.
e.g., a research lab carelessly pursuing gain-of-function research, deliberately starting risky training runs for financial/academic incentives, etc.
Some arguments against:
Accident risks from financial incentives are probably weaker for AI gain-of-function.
The standard gain-of-function risk scenario is: research lab engineers a dangerous pathogen, it accidentally leaks, and a pandemic happens.
I don’t see how these events would happen “accidentally” when dealing with AI programs; e.g., the researcher would have to deliberately cut parts of the network weights and replace it with the enzyme, which is certainly intentional.
I recently learned about metauni, and it looks amazing. TL;DR, a bunch of researchers give out lectures or seminars on Roblox—Topics include AI alignment/policy, Natural Abstractions, Topos Theory, Singular Learning Theory, etc.
I haven’t actually participated in any of their live events yet and only watched their videos, but they all look really interesting. I’m somewhat surprised that there hasn’t been much discussion about this on LW!
Steven Strogatz mentions:
A few weeks ago, someone was asking about “complex systems”. Is there anything to it, or is it just a buzzword? How does it compare to agent-based modeling, chaos theory, systems science, etc.? This concise survey by Mark Newman answers those questions: arxiv.org/abs/1112.1440
I’ve noticed during my alignment study that just the sheer amount of relevant posts out there is giving me a pretty bad habit of (1) passively engaging with the material and (2) not doing much independent thinking. Just keeping up to date & distilling the stuff in my todo read list takes up most of my time.
I guess the reason I do it is because (at least for me) it takes a ton of mental effort to switch modes between “passive consumption” and “active thinking”:
I noticed then when self-studying math; like, my subjective experience is that I enjoy both “passively listening lectures+taking notes” and “solving practice problems,” the problem is that it takes a ton of mental energy to switch between the two equilibriums.
(This is actually still a problem—too much wide & passive consumption rather than actively practicing them and solving problems.)
Also relevant is wanting to just progress/upskill as fast and wide of a subject as I can, sacrificing mastery for diversity. This probably makes sense to some degree (especially in the sense that having more frames is good), but I think I’m taking this wayyyy too far.
My for opening new links far exceeds 1. This definitely helped me when I was trying to get a rapid overview of the entire field, but now it’s just a bad adaptation + akrasia.
Okay, then, don’t do that! Some directions to move towards:
Independent brainstorming/investigation sessions to form concrete inside views
like the advice from the field guide post, exercises from the MATS model, et
Commitment mechanisms, like making regular posts or shortforms (eg)
Awesome post! I broadly agree with most of the points and think hodge-podging would be a fairly valuable agenda to further pursue. Some thoughts:
What could AI alignment look like if we had 6000+ full-time researchers and software developers?
My immediate impression is that, if true, this makes hodge-podging fairly well suited for automation (compared to conceptual/theoretical work, based on reasons laid out here)
But when we assemble the various methods, suddenly that works great because there’s a weird synergy between the different methods.
I agree that most synergies would be positive, but the way it was put in this post seems to imply that they would be sort of unexpected. Isn’t the whole point of having primitives & taxonomizing type signatures to ensure that their composition’s behaviors are predictable and robust?
Perhaps I’m uncertain as to what level of “formalization” hodge-podging would be aiming for. If it’s aiming for a fully mathematically formal characterization of various safety properties (eg PreDCA-style) then sure, it would permit lossless provable guarantees of the properties of its composition, as is the case with cryptographic primitives (there are no unexpected synergies from assembling them).
But if they’re on the level of ELK/plausible-training-stories level of formalization, I suspect hodge-podging would less be able to make composition guarantees as the “emergent” features from composing them start to come into the picture. At that point, how can it guarantee that there aren’t any negative synergies the misaligned AI could exploit?
(I might just be totally confused here given that I know approximately nothing about categorical systems theory)
For the next step, I might post a distillation of David Jaz Myers Categorical systems theory which treats dynamic systems and their typed wirings as polymorphic lenses.
Please do!
It seems that the greedy selection algorithms and the AI itself can be trusted with approximately 0% of the alignment work, and approximately 100% of it will need to be done by hand.
...
(in the comment)
even if the value-humans shard is perfect, the AI might just figure out some galaxy-brained merger of it with a bunch of other shards, that makes logical sense to it as an extrapolation, and just override the value-humans shard’s protests.
Would the galaxy-brained merger necessarily be a bad example? (I’m confused as to what you think the “end goal” of alignment should look like—what would even be a good example of a well-compiled value, behaving in totally OOD environments?)
If I understood correctly, the compiled-value is what lets the AI extrapolate its values to OOD environments via the virtue of abstraction (which may look totally alien to us).
But those compiled values should still do well in-distribution, if the greedy training dynamic is still in play!
So, (assuming the AI is trained in the limit using ~infinite data that’s representative of in-distribution human day-to-day-life/examples of “good-values” being executed) the galaxy-brain-compiled-value should still act in accordance to our values in in-distribution contexts, which includes e.g., not killing everyone.
Or maybe I misunderstood. Alternatively, perhaps your galaxy-brain example was set during which the GPS already overpowered SGD and can pursue whatever mesa-objective it had earlier on. Value-compilation would be part of it, since it would’ve been an adaptive strategy during the greedy training process (i.e. better generalization capability to score well on in-distribution training data).
But even in this case, I think it’s still plausible that the GPS will value-compile and abstractize its values in a way that respects behaviors in in-distribution (especially since that’s what it had to do during training!)
So overall, I guess my questions are:
What even would an good OOD extrapolation of human values look like?
(relatedly) Why do you think the galaxy-brained merger is bad?
Is my understanding: “GPS’s tendency to value-compile initially forms when SGD is still a dominant force in the training dynamic” correct?
And when SGD loses control and GPS’s value-compilation becomes the dominant force, how would that value-compilation generalize? Would it do so in a way that respects earlier-in-training in-distribution samples? (like not killing humans)
Nice review!
Eliciting the latent knowledge “are you planning to kill us?” goes a long way
I know this is non-central, but just wanted to point out that this would miss 100% of the thoughts it does not think—like how I don’t think/deliberately plan to step on bugs when I accidentally do so.
People mean different things when they say “values” (object vs meta values)
I noticed that people often mean different things when they say “values,” and they end up talking past each other (or convergence only happens after a long discussion). One of the difference is in whether they contain meta-level values.
Some people refer to the “object-level” preferences that we hold.
Often people bring up the “beauty” of the human mind’s capacity for its values to change, evolve, adopt, and grow—changing mind as it learns more about the world, being open to persuasion via rational argumentation, changing moral theories, etc.
Some people include the meta-values (that are defined on top of other values, and the evolution of such values).
e.g., My “values” include my meta-values, like wanting to be persuaded by good arguments, wanting to change my moral theories when I get to know better, even “not wanting my values to be fixed”
example of this view: carado’s post on you want what you want, and one of Vanessa Cosoy’s shortform/comment (can’t remember the link)
It seems like retrieval-based transformers like RETRO is “obviously” the way to go—(1) there’s just no need to store all the factual information as fixed weights, (2) and it uses much less parameter/memory. Maybe mechanistic interpretability should start paying more attention to these type of architectures, especially since they’re probably going to be a more relevant form of architecture.
They might also be easier to interpret thanks to specialization!
Useful perspective when thinking of mechanistic pictures of agent/value development is to take the “perspective” of different optimizers, consider their relative “power,” and how they interact with each other.
E.g., early on SGD is the dominant optimizer, which has the property of (having direct access to feedback from U / greedy). Later on early proto-GPS (general-purpose search) forms, which is less greedy, but still can largely be swayed by SGD (such as having its problem-specification-input tweaked, having the overall GPS-implementation modified, etc). Much later, GPS becomes the dominant optimizing force “at run-time” which shortens the relevant time-scale and we can ignore the SGD’s effect. This effect becomes much more pronounced after reflectivity + gradient hacking when the GPS’s optimization target becomes fixed.
(very much inspired by reading Thane Ruthenis’s value formation post)
This is a very useful approximation at the late-stage when the GPS self-modifies the agent in pursuit of its objective! Rather than having to meticulously think about local SGD gradient incentives and such, since GPS is non-greedy, we can directly model it as doing what’s obviously rational from a birds-eye-perspective.
(kinda similar to e.g., separation of timescale when analyzing dynamical systems)
I think the argument beyond 5D hinges on the fact that Bs and Gs will be represented in the WM such that the GPS can take it as part of the problem specification.
Arguments in favor:
GPS has been part of the heuristics (shards), so it needs to be able to use their communication channel. This implies that the GPS reverse-engineered the heuristics. Since GPS has write access to the WM, that implies the Bs and Gs might be included there.
Once included in the WM, it isn’t too hard for the SGD to point the GPS towards it. At that point, there’s a positive feedback loop that incentivizes both (1) pointing GPS even more towards Bs and Gs and (2) making B/G representation even more explicit in the WM.
Arguments against:
By the point 5D happens, GPS should already be well-developed and part of the heuristics, which means they would be a very good approximation of U. This implies strong gradient starvation, so there just might not be any incentive for SGD to do any of this.
If GPS becomes critically reflective before sufficient B/G representation in the WM, then gradient hacking locks in the heuristic-driven-GPS forever.
So, it’s either (1) they do get represented and arguments after 5D holds, or (2) they don’t get represented and the heuristics end up dominating, basically the shard theory picture.
I think this might be one of the line that divides your model with the Shard Theory model, and as of now I’m very uncertain as to which picture is more likely.
My argument is that they wouldn’t actually be a good cross-context approximation of U; in part because of gradient starvation.
Ah bad phrasing—where you quoted me (arguments against part) I meant to say:
Heuristic-driven-GPS is a very good approximation of U only within in-distribution context
… and this is happening at a phase where SGD is still the dominant force
… and Heuristic-driven-GPS is doing all this stuff without being explicitly aimed towards Bs and Ds, but rather the GPS is just part of the “implicit” M → A procedures/modules
… therefore “gradient starvation would imply SGD won’t have incentives to represent Bs and Ds as part of the WM”
(I’m intentionally ignoring the value compilation and focusing on 5D because (1) it seems like having Bs and Ds represented is a necessary precursor for all the stuff that comes after that which makes it a useful part to zoom at, and (2) I haven’t really settled my thoughts/confusions on value-compilation)
Does my arguments in favor/against in the original comment capture your thoughts on how likely it is that Bs and Ds get represented in the WM? And is your positive conclusion because any one of them seem more likely to matter?
(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
(I need to read more literature)
Okay, more questions incoming: “Why would GPS be okay with value-compilation, when its expected outcome is to not satisfy in-distribution context behaviors through big-brain moves?”
If I understood correctly (can be skipped; not relevant to my argument, which starts after the bullet points):
Early in training, GPS is part of the heuristic/shard implementation (accessed via API calls)
Middle in training, there is some SGD-incentive towards pointing GPS in the direction of “reverse-engineering heuristic/shard firing patterns, representing them (Gs) as WM variables, and pursuing them directly”
Later in training, there is some SGD-incentive towards pointing GPS to do “value-compilation” by merging/representing more abstract i.e. “compiled” versions of the previous goal pointers and directly pursuing them.
In other words, there are some GPS API calls that say “do value compilation over GPS targets”, and some (previously formed) GPS API calls that say “pursue this reverse-engineered-heuristic-goal”
Very late (approx SGD influence ~ 0 due to starvation + hacking, where we can abstractize SGD as being extensions of GPS), everything is locked-in
Tangent: we can effectively model the AI as subagents, with each subagent is a GPS API-call. It wouldn’t be exactly right to call them “shards,” and it wouldn’t be right to call them “GPS” either (because GPS is presumably a single module; maybe not?). A new slick terminology might be needed … I’ll just call them “Subagents” for now.
To rephrase my question, why would the other (more early-formed) GPS API calls be okay with the API calls for value-compilation?
As you mentioned in a different comment thread, there is no reason for the GPS to obey in-distribution behavior (inner misalignment). So, from the perspective of a GPS that’s API-called with pursuing a reverse-engineered-heuristic-goal, it would think:
“hm, I notice that in the WM & past firing patterns, several goal-variables are being merged/abstractized and end up not obeying the expected in-distribution context behaviors, presumably due to some different GPS-API-call. Well, I don’t want that happening to my values—I need to somehow gradient hack in order to prevent that from happening!”
I think this depends on the optimization “power” distribution between different GPS API-calls (tangent: how is it possible for them to have different power when the GPS, presumably, is a modular thing and the only difference is in how they’re called? Whatever). Only if the API call for value compilation overwhelms the incentive against value compilation from the rest of the API calls (which all of them have an incentive for doing so, and would probably collude) then can value compilation actually proceed—which seems pretty unlikely.
Given this analysis, it seems like the default behavior is for the GPS API-calls to gradient hack away whatever other API-calls that would predictably result in in-distribution behaviors not getting preserved (e.g., value-compilation).
Alternate phrasing: The “Subagents” (check earlier bullet point for terminology) will have an incentive themselves to solve inner misalignment. Therefore, “Subagents” like “Value-compilation-GPS-API-call” that may do dangerous “big-brain” moves are naturally the enemy of everyone else, and will be gradient-hacked away from existence.
Is there any particular reason to believe the GPS API-calls-for-value-compilation would be so strongly chiseled in by the SGD (when SGD still has influence) as to overwhelm all the other API-calls (when SGD stops mattering)?
Wait, so PreDCA solves inner-misalignment by just … assuming that “we will later have an ideal learning theory with provable guarantees”?
By the claim “PreDCA solves inner-misalignment” as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of “demons-from-imperfect-search won’t be a problem because of amazing theory” and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).
If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment
Different GPS instances aren’t exactly “subagents”, they’re more like planning processes tasked to solve a given problem.
You’re right that GPS-instances (nice term btw) aren’t necessarily subagents—I missed that your GPS formalization does argmin over WM variable for a specific t, not all t, which means it doesn’t have to care about controlling variables at all time.
With that said …
(tangent: I’m still unsure as to whether that is the right formalization for GPS—but I don’t have a better alternative yet)
… there seems to be a selection effect where GPS-instances that don’t care about preserving future API-call-context gets removed, leaving only subagent-y GPS-instances over time.
An example of such property would be having a wide range of t values for the problem specification.
Generalizing, when GPS is the dominant force, only GPS-instances that care about preserving call-context survives (and eg surgeries-away all the non-preserving ones), and then we can model the AI as being composed of actual subagents.
this sounds like a fairly binary property—either you care (and hence participate in inter-subagent game theory) or you don’t (and disappear).
Now, there are three important questions: (1) does greedy SGD incentivize “care” for preserving future call-contexts? (2) if so, what are the late-stage dynamics? (3) and for which GPS-instances do “care” occur?
What would make it so that SGD chisels in GPS API-calls that “care” about preserving future call-contexts?
SGD-dominant training phase: Well, by accident. Possibility:
certain heuristics may require setting the GPS t value across a very wide range, accidentally making it “care” about preservation more so than others.
SGD-GPS-transition phase: GPS starts to do some very crude gradient hacking. It’s really crude, so there’s some SGD incentive to point the GPS API-calls for all heuristics to strategize against it.
now, all heuristics are slowly becoming more preserving.
GPS-dominant phase: Those that care about preservation survives, and surgeries away the non-caring API-calls.
What matters in the dynamics at the GPS-dominant phase? Ideas:
“Relative power” matters the most
e.g., even if some GPS-instance cared about its goals for all t, if powerless, other more-powerful-and-less-caring instances just won’t give it a say.
“Range of care” matters
e.g., some instances care for t:0~10. some care for t:80~95. some care for t:37~999. they might trade to give different instances more “power” for different times as to be pareto-efficient in some way.
note: again, not sure this is the right ontology to think about GPS calls (like implicitly using real-valued range of continuous t values)
What kind of GPS-instances are more likely to “care” more?
e.g., for value-compilation to have the kind of impact described in the post, it should probably (1) have sufficient “power” as to not get ignored in the trading, and (2) have an extremely wide range of “care.” it’s not obvious that this is the case.
But it’s not obvious that we want maximum value compilation. Maximum value compilation leads to e. g. an AGI with a value-humans shard who decides to do a galaxy-brained merger of that shard with something else and ends up indifferent to human welfare.
Nice insight! Perhaps we get a range of diverse moral philosophies by tweaking this single variable “value-compilation-degree,” with deontology on the one end and galaxy-brained merger on the other end.
Combining with the above three questions, I think a fairly good research direction would be to (1) formalize what it means for the GPS-instances to have more “power” or have a “range of care,” (2) and how to tweak/nudge these values for GPS-instances of our choosing (in order to e.g., tweak the degree of value compilation).
Quick thoughts on my plans:
I want to focus on having a better mechanistic picture of agent value formation & distinguishing between hypotheses (e.g., shard theory, Thane Ruthenis’s value-compilation hypothesis, etc) and forming my own.
I think I have a specific but very high uncertainty baseline model of what-to-expect from agent value-formation using greedy search optimization. It’s probably time to allocate more resources on reducing that uncertainty by touching reality i.e. running experiments.
(and also think about related theoretical arguments like Selection Theorem)
So I’ll probably allocate my research time:
Studying math (more linear algebra / dynamical systems / causal inference / statistical mechanics)
Sketching a better picture of agent development, assigning confidence, proposing high-bit experiments (that might have the side-effect of distinguishing between different conflicting pictures), formalization, etc.
and read relevant literature (eg ones on theoretic DL and inductive biases)
Upskilling mechanistic interpretability to actually start running quick experiments
Unguided research brainstorming (e.g., going through various alignment exercises, having a writeup of random related ideas, etc)
Possibly participate in programs like MATS? Probably the biggest benefit to me would be (1) commitment mechanism / additional motivation and (2) high-value conversations with other researchers.
Dunno, sounds pretty reasonable!
Random alignment-related idea: train and investigate a “Gradient Hacker Enzyme”
TL;DR, Use meta-learning methods like MAML to train a network submodule i.e. circuit that would resist gradient updates in a wide variety of contexts (various architectures, hyperparameters, modality, etc), and use mechanistic interpretability to see how it works.
It should be possible to have a training setup for goals other than “resist gradient updates,” such as restricting the meta-objective to a specific sub-sub-circuit. In that case, the outer circuit might (1) instrumentally resist updates, or (2) somehow get modified while keeping its original behavioral objective intact.
This setup doesn’t have to be restricted to circuits of course; there was a previous work which did this on the level of activations, although iiuc the model found a trivial solution by exploiting relu—it would be interesting to extend this to more diverse setup.
Anyways, varying this “sub-sub-circuit/activation-to-be-preserved” over different meta-learning episodes would incentivize the training process to find “general” Gradient Hacker designs that aren’t specific to a particular circuit/activation—a potential precursor for various forms of advanced Gradient Hackers (and some loose analogies to how enzymes accelerate reactions).
What is the Theory of Impact for training a “Gradient Hacker Enzyme”?
(note: while I think these are valid, they’re generated post-hoc and don’t reflect the actual process for me coming up with this idea)
Estimating the lower-bound for the emergence Gradient Hackers.
By varying the meta-learning setups we can get an empirical estimate for the conditions in which Gradient Hackers are possible.
Perhaps gradient hackers are actually trivial to construct using tricks we haven’t thought of before (like the relu example before). Maybe not! Perhaps they require [high-model-complexity/certain-modality/reflective-agent/etc].
Why lower-bound? In a real training environment, gradient hackers appear because of (presumably) convergent training incentives. Instead in the meta-learning setup, we’re directly optimizing for gradient hackers.
Mechanistically understanding how Gradient Hackers work.
Applying mechanistic interpretability here might not be too difficult, because the circuit is cleanly separated from the rest of the model.
There has been several speculations on how such circuits might emerge. Testing them empirically sounds like a good idea!
This is just a random idea and I’m probably not going to work on it; but if you’re interested, let me know. While I don’t think this is capabilities-relevant, this probably falls under AI gain-of-function research and should be done with caution.