Let the wonder never fade!
Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.
Let the wonder never fade!
Aspiring alignment researcher with a keen interest in agent foundations. Studying math, physics, theoretical CS (Harvard 2027). Contact me via Discord: dalcy_me, email: dalcy.mail@gmail.com. They / Them, He / Him.
What are the errors in this essay? As I’m reading through the Brain-like AGI sequence I keep seeing this post being referenced (but this post says I should instead read the sequence!)
I would really like to have a single reference post of yours that contains the core ideas about phasic dopamine rather than the reference being the sequence posts (which is heavily dependent on a bunch of previous posts; also Post 5 and 6 feels more high-level than this one?)
gwern’s take on a similar paper (Tinystories), in case anyone was wondering. Notable part for me:
...
Now, what would be really interesting is if they could go beyond the in-domain tasks and show something like meta-learning. That’s supposed to be driven by the distribution and variety of Internet-scale datasets, and thus should not be elicited by densely sampling a domain like this.
moments of microscopic fun encountered while studying/researching:
Quantum mechanics call vector space & its dual bra/ket because … bra-c-ket. What can I say? I like it—But where did the letter ‘c’ go, Dirac?
Defining cauchy sequences and limits in real analysis: it’s really cool how you “bootstrap” the definition of Cauchy sequences / limit on real using the definition of Cauchy sequences / limit on rationals. basically:
(1) define Cauchy sequence on rationals
(2) use it to define limit (on rationals) using rational-Cauchy
(3) use it to define reals
(4) use it to define Cauchy sequence on reals
(5) show it’s consistent with Cauchy sequence on rationals in both directions
a. rationals are embedded in reals hence the real-Cauchy definition subsumes rational-Cauchy definition
b. you can always find a rational number smaller than a given real number hence a sequence being rational-Cauchy means it is also real-Cauchy)
(6) define limit (on reals)
(7) show it’s consistent with limit on rationals
(8) … and that they’re equivalent to real-Cauchy
(9) proceed to ignore the distinction b/w real-Cauchy/limit and their rational counterpart. Slick!
(will probably keep updating this in the replies)
Any advice on reducing neck and shoulder pain while studying? For me that’s my biggest blocker to being able to focus longer (especially for math, where I have to look down at my notes/book for a long period of time). I’m considering stuff like getting a standing desk or doing regular back/shoulder exercises. Would like to hear what everyone else’s setups are.
Just wanted to comment that this is an absolutely amazing resource and have saved me a ton of time trying to get into this field & better understand several of the core papers. Thank you so much for writing this!
(Quality: Low, only read when you have nothing better to do—also not much citing)
30-minute high-LLM-temp stream-of-consciousness on “How do we make mechanistic interpretability work for non-transformers, or just any architectures?”
We want a general way to reverse engineer circuits
e.g., Should be able to rediscover properties we discovered from transformers
Concrete Example: we spent a bunch of effort reverse engineering transformer-type architectures—then boom, suddenly some parallel-GPU-friendly-LSTM architecutre turns out to have better scaling properties, and everyone starts using it. LSTMs have different inductive biases, like things in the same layer being able to communicate multiple times with each other (unlike transformers), which incentivizes e.g., reusing components (more search-y?).
Formalize:
You have task X. You train a model A with inductive bias I_A. You also train a model B with inductive bias I_B. Your mechanistic interpretability techniques work well on deciphering A, but not B. You want your mechanistic interpretability techniques to work well for B, too.
Proposal: Communication channel
Train a Transformer on task X
Existing Mechanistic interpretability work does well on interpreting this architecture
Somehow stitch the LSTM to the transformer (?)
I’m trying to get at to the idea of “interface conversion,” that by the virtue of SGD being greedy, it will try to convert the outputs of transformer-friendly types
Now you can better understand the intermediate outputs of the LSTM by just running mechanistic interpretability on the transformer layers whose input are from the LSTM
(I don’t know if I’m making any sense here, my LLM temp is > 1)
Proposal: approximation via large models?
Train a larger transformer architecture to approximate the smaller LSTM model (either just input output pairs, or intermediate features, or intermediate features across multiple time-steps, etc):
the basic idea is that a smaller model would be more subject to following its natural gradient shaped by the inductive bias, while larger model (with direct access to the intermediate outputs of the smaller model) would be able to approximate it despite not having as much inductive bias incentive towards it.
probably false but illustrative example: Train small LSTM on chess. By the virtue of being able to run serial computation on same layers, it focuses on algorithms that have repeating modular parts. In contrast, a small Transformer would learn algorithms that don’t have such repeating modular parts. But instead, train a large transformer to “approximate” the small LSTM—it should be able to do so by, e.g., inefficiently having identical modules across multiple layers. Now use mechanistic interpretability on that.
Proposal: redirect GPS?
Thane’s value formation picture says GPS should be incentivized to reverse-engineer the heuristics because it has access to inter-heuristic communication channel. Maybe, in the middle of training, gradually swap different parts of the model with those that have different inductive biases, see GPS gradually learn to reverse-engineer those, and mechanistically-interpret how GPS exactly does that, and reimplement in human code?
Proposal: Interpretability techniques based on behavioral constraints
e.g., Discovering Latent Knowledge without Supervision, putting constraints?
How to do we “back out” inductive biases, just given e.g., architecture, training setup? What is the type signature?
(I need to read more literature)
Bella is meeting a psychotherapist, but they treat her fear as something irrational. This doesn’t help, and only makes Bella more anxious. She feels like even her therapist doesn’t understand her.
How would one find a therapist in their local area who’s aware of what’s going on in the EA/rat circles such that they wouldn’t find statements about, say, x-risks as being schizophrenic/paranoid?
Awesome post! I broadly agree with most of the points and think hodge-podging would be a fairly valuable agenda to further pursue. Some thoughts:
What could AI alignment look like if we had 6000+ full-time researchers and software developers?
My immediate impression is that, if true, this makes hodge-podging fairly well suited for automation (compared to conceptual/theoretical work, based on reasons laid out here)
But when we assemble the various methods, suddenly that works great because there’s a weird synergy between the different methods.
I agree that most synergies would be positive, but the way it was put in this post seems to imply that they would be sort of unexpected. Isn’t the whole point of having primitives & taxonomizing type signatures to ensure that their composition’s behaviors are predictable and robust?
Perhaps I’m uncertain as to what level of “formalization” hodge-podging would be aiming for. If it’s aiming for a fully mathematically formal characterization of various safety properties (eg PreDCA-style) then sure, it would permit lossless provable guarantees of the properties of its composition, as is the case with cryptographic primitives (there are no unexpected synergies from assembling them).
But if they’re on the level of ELK/plausible-training-stories level of formalization, I suspect hodge-podging would less be able to make composition guarantees as the “emergent” features from composing them start to come into the picture. At that point, how can it guarantee that there aren’t any negative synergies the misaligned AI could exploit?
(I might just be totally confused here given that I know approximately nothing about categorical systems theory)
For the next step, I might post a distillation of David Jaz Myers Categorical systems theory which treats dynamic systems and their typed wirings as polymorphic lenses.
Please do!
If after all that it still sounds completely wack, check the date. Anything from before like 2003 or so is me as a kid, where “kid” is defined as “didn’t find out about heuristics and biases yet”, and sure at that age I was young enough to proclaim AI timelines or whatevs.
https://twitter.com/ESYudkowsky/status/1650180666951352320
Found an example in the wild with Mutual information! These equivalent definitions of Mutual Information undergo concept splintering as you go beyond just 2 variables:
interpretation: common information
… become co-information, the central atom of your I-diagram
interpretation: relative entropy b/w joint and product of margin
… become total-correlation
interpretation: joint entropy minus all unshared info
… become bound information
… each with different properties (eg co-information is a bit too sensitive because just a single pair being independent reduces the whole thing to 0, total-correlation seems to overcount a bit, etc) and so with different uses (eg bound information is interesting for time-series).
I am very interested in this, especially in the context of alignment research and solving not-yet-understood problems in general. Since I have no strong commitments this month (and was going to do something similar to this anyways), I will try this every day for the next two weeks and report back on how it goes (writing this comment as a commitment mechanism!)
Have a large group of people attempt to practice problems from each domain, randomizing the order that they each tackle the problems in. (The ideal version of this takes a few months)
...
As part of each problem, they do meta-reflection on “how to think better”, aiming specifically to extract general insights and intuitions. They check what processes seemed to actually lead to the answer, even when they switch to a new domain they haven’t studied before.
Within this upper-level feedback loop (at the scale of whole problems, taking hours or days), I’m guessing a lower-level loop would involve something like cognitive strategy tuning to get real-time feedback as you’re solving the problems?
Complaint with Pugh’s real analysis textbook: He doesn’t even define the limit of a function properly?!
It’s implicitly defined together with the definition of continuity where , but in Chapter 3 when defining differentiability he implicitly switches the condition to without even mentioning it (nor the requirement that now needs to be an accumulation point!) While Pugh has its own benefits, coming from Terry Tao’s analysis textbook background, this is absurd!
(though to be fair Terry Tao has the exact same issue in Book 2, where his definition of function continuity via limit in metric space precedes that of defining limit in general … the only redeeming factor is that it’s defined rigorously in Book 1, in the limited context of )
*sigh* I guess we’re still pretty far from reaching the Pareto Frontier of textbook quality, at least in real analysis.
… Speaking of Pareto Frontiers, would anyone say there is such a textbook that is close to that frontier, at least in a different subject? Would love to read one of those.
I used to try out near-random search on ideaspace, where I made a quick app that spat out 3~5 random words from a dictionary of interesting words/concepts that I curated, and I spent 5 minutes every day thinking very hard on whether anything interesting came out of those combinations.
Of course I knew random search on exponential space was futile, but I got a couple cool invention ideas (most of which turned out to already exist), like:
infinite indoor rockclimbing: attach rocks to a vertical treadmill, and now you have an infinite indoor rock climbing wall (which is also safe from falling)! maybe add some fancy mechanism to add variations to the rocks + a VR headgear, I guess.
clever crypto mechanism design (in the spirit of CO2 Coin) to incentivize crowdsourcing of age-reduction molecule design animal trials from the public. (I know what you’re thinking)
You can probably do this smarter now if you wanted, with eg better GPT models.
There’s still some pressure, though. If the bites were permanently not itchy, then I may have not noticed that the mosquitos were in my room in the first place, and consequently would less likely pursue them directly. I guess that’s just not enough.
Having lived ~19 years, I can distinctly remember around 5~6 times when I explicitly noticed myself experiencing totally new qualia with my inner monologue going “oh wow! I didn’t know this dimension of qualia was a thing.” examples:
hard-to-explain sense that my mind is expanding horizontally with fractal cube-like structures (think bismuth) forming around it and my subjective experience gliding along its surface which lasted for ~5 minutes after taking zolpidem for the first time to sleep (2 days ago)
getting drunk for the first time (half a year ago)
feeling absolutely euphoric after having a cool math insight (a year ago)
...
Reminds me of myself around a decade ago, completely incapable of understanding why my uncle smoked, being “huh? The smoke isn’t even sweet, why would you want to do that?” Now that I have [addiction-to-X] as a clear dimension of qualia/experience solidified in myself, I can better model their subjective experiences although I’ve never smoked myself. Reminds me of the SSC classic.
Also one observation is that it feels like the rate at which I acquire these is getting faster, probably because of increase in self-awareness + increased option space as I reach adulthood (like being able to drink).
Anyways, I think it’s really cool, and can’t wait for more.
That means the problem is inherently unsolvable by iteration. “See what goes wrong and fix it” auto-fails if The Client cannot tell that anything is wrong.
Not at all meant to be a general solution to this problem, but I think that a specific case where we could turn this into something iterable is by using historic examples of scientific breakthroughs—consider past breakthroughs to a problem where the solution (in hindsight) is overdetermined, train the AI on data filtered by date, and The Client evaluates the AI solely based on how close the AI approaches that overdetermined answer.
As a specific example: imagine feeding the AI historical context that led up to the development of information theory, and checking if the AI converges onto something isomorphic to what Shannon found (training with information cutoff, of course). Information theory surely seems like The Over-determined Solution for tackling the sorts of problems that it was motivated by, and so the job of the client/evaluator is much easier.
Of course this is probably still too difficult in practice (eg not enough high-quality historical data of breakthroughs, evaluation & data-curation still demanding great expertise, hope of ”… and now our AI should generalize to genuinely novel problems!” not cashing out, scope of this specific example being too limited, etc).
But the situation for this specific example sounds somewhat better than that laid out in this post, i.e. The Client themselves needing the expertise to evaluate non-hindsight based supposed Alignment breakthroughs & having to operate on completely novel intellectual territory.
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
eg crutchfield talks a lot about developing a right notion of information flow—obvious usefulness in eg formalizing boundaries?
many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.