If a tree falls in the forest, and two people are around to hear it, does it make a sound?
I feel like typically you’d say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.
But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it’s more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.
I think this just repeats the original ambiguity of the question, by using the word “sound” in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable. It’s still a question of definition, not of understanding what actually happens.
But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)
My recommended way to resolve (aka disambiguate) definitional questions is “use more words”. Common understandings can be short, but unusual contexts require more signals to communicate.
I think we’re playing too much with the meaning of “sound” here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people
Finally gonna start properly experimenting on stuff. Just writing up what I’m doing to force myself to do something, not claiming this is necessarily particularly important.
Llama (and many other models, but I’m doing experiments on Llama) has a piece of code that looks like this:
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) out = h + self.feed_forward(self.ffn_norm(h))
Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of “writes” to the residual stream using these two vectors.
I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express M=∑isi(ui⊗vi), where the u‘s and v’s are independent units vectors. This basically decomposes the “writes” into some independent locations in the residual stream (u’s), some latent directions that are written to (v’s) and the strength of those writes (s’s, aka the singular values).
To get a feel for the complexity of the writes, I then plotted the s’s in descending order. For the prompt “I believe the meaning of life is”, Llama generated the continuation “to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. If you follow your heart, you will find happiness. If you don’t follow your heart, you will never find happiness. I believe that the meaning of life is to”. During this continuation, there were 2272 writes to the residual stream, and the singular values for these writes were as follows:
The first diagram shows that there were 2 directions that were much larger than all the others. The second diagram shows that most of the singular values are nonnegligible, which indicates to me that almost all of the writes transfer nontrivial information. This can also be seen in the last diagram, where the cumulative size of the singular values increases approximately logaritmically with their count.
This is kind of unfortunate, because if almost all of the s was concentrated in a relatively small number of dimensions (e.g. 100), then we could simplify the network a lot by projecting down to these dimensions. Still, this was relatively expected because others had found the singular values of the neural networks to be very complex.
Since variance explained is likely nonlinearly related to quality, my next step will likely be to clip the writes to the first k singular vectors and see how that impacts the performance of the network.
Ok, so I’ve got the clipping working. First, some uninterpretable diagrams:
In the bottom six diagrams, I try taking varying number (x-axis) of right singular vectors (v’s) and projecting down the “writes” to the residual stream to the space spanned by those vectors.
The obvious criterion to care about is whether the projected network reproduces the outputs of the original network, which here I operationalize based on the log probability the projected network gives to the continuation of the prompt (shown in the “generation probability” diagrams). This appears to be fairly chaotic (and low) in the 1-300ish range, and then stabilizes while still being pretty low in the 300ish-1500ish range, and then finally converges to normal in the 1500ish to 2000ish range, and is ~perfect afterwards.
The remaining diagrams show something about how/why we have this pattern. “orig_delta” concerns the magnitude of the attempted writes for a given projection (which is not constant because projecting in earlier layers will change the writes by later layers), and “kept_delta” concerns the remaining magnitude after the discarded dimensions have been projected away.
In the low end, “kept_delta” is small (and even “orig_delta” is a bit smaller than it ends up being at the high end), indicating that the network fails to reproduce the probabilities because the projection is so aggressive that it simply suppresses the network too much.
Then in the middle range, “orig_delta” and “kept_delta” explodes, indicating that the network has some internal runaway dynamics which normally would be suppressed, but where the suppression system is broken by the projection.
Finally, in the high range, we get a sudden improvement in loss, and a sudden drop in residual stream “write” size, indicating that it has managed to suppress this runaway stuff and now it works fine.
An implicit assumption I’m making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me (“bigger = more important”), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities:
Clearly there is a correlation, but also clearly there’s some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It’s tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.
Back to clipping away an entire range, rather than a single dimension. Here’s ordering it by the importance computed by clipping away a single dimension:
Less chaotic maybe, but also much slower at reaching a reasonable performance, so I tried a compromise ordering that takes both size and performance into account:
Doesn’t seem like it works super great tbh.
Edit: for completeness’ sake, here’s the initial graph with log-surprise-based plotting.
To quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping.
A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate “I believe the meaning of life is” using the 1886-dimensional subspace from
[I believe the meaning of life is] to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. It is the only way to live a happy life. It is the only way to be happy. It is the only way to be happy. The meaning of life is
, I get:
[I believe the meaning of life is] to find happy. We is the meaning of life. to find a happy. And to live a happy and. If to be a a happy. . to be happy. . to be happy. . to be a happy.. to be happy. . to be happy.
Which seems sort of vaguely related, but idk.
Another test is just generating without any prompt, in which case these vectors give me:
Question is a single thing to find. to be in the best to be happy. I is the only way to be happy. I is the only way to be happy. I is the only way to be happy. It is the only way to be happy.. to be happy.. to be happy. to
Using a different prompt:
[Simply put, the theory of relativity states that ]1) the laws of physics are the same for all non-accelerating observers, and 2) the speed of light in a vacuum is the same for all observers, regardless of their relative motion or of the motion of the source of the light. Special relativity is a theory of the structure of spacetime
I can get a 3329-dimensional subspace which generates:
[Simply put, the theory of relativity states that ] 1) time is relative and 2) the speed of light in a vacuum is constant for all observers. 1) Time is relative, meaning that if two observers are moving relative to each other, the speed of light is the same for all observers, regardless of their motion. For example, if you are moving relative
or
Question: In a simple harmonic motion, the speed of an object is A) constant B) constant C) constant D) constant In the physics of simple harmonic motion, the speed of an object is constant. The speed of the object can be constant, but the speed of an object can be
Another example:
[A brief message congratulating the team on the launch:
Hi everyone,
I just ] wanted to congratulate you all on the launch. I hope that the launch went well. I know that it was a bit of a challenge, but I think that you all did a great job. I am proud to be a part of the team.
Thank you for your
can yield 2696 dimensions with
[A brief message congratulating the team on the launch:
Hi everyone,
I just ] wanted to say you for the launch of the launch of the team.
The launch was successful and I am so happy to be a part of the team and I am sure you are all doing a great job.
I am very looking to be a part of the team.
Thank you all for your hard work,
or
def measure and is the definition of the new, but the the is a great, but the The is the The is a The is a The is a The The is a The The The is a The The is a
And finally,
[Translate English to French:
sea otter ⇒ loutre de mer peppermint ⇒ menthe poivrée plush girafe ⇒ girafe peluche cheese =>] fromage pink ⇒ rose blue ⇒ bleu red ⇒ rouge yellow ⇒ jaune purple ⇒ violet brown ⇒ brun green ⇒ vert orange ⇒ orange black ⇒ noir white ⇒ blanc gold ⇒ or silver ⇒ argent
can yield the 2518-dimensional subspace:
[Translate English to French:
sea otter ⇒ loutre de mer peppermint ⇒ menthe poivrée plush girafe ⇒ girafe peluche cheese =>] fromage cheese ⇒ fromage cheese ⇒ fromage f cheese ⇒ fromage butter ⇒ fromage apple ⇒ orange yellow ⇒ orange green ⇒ vert black ⇒ noir blue ⇒ ble purple ⇒ violet white ⇒ blanc
or
Question: A 201 The sum of a The following the sum the time the sum the the the The The The The The The The The The The The The The The The The The The
Given the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much?
I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot:
… unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.
If I look at the pairwise overlap between the dimensions needed for each generation:
… then this is predictable down to ~1% error simply by assuming that they pick a random subset of the dimensions for each, so their overlap is proportional to each of their individual sizes.
Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn’t 100% accurate.
Thesis: while consciousness isn’t literally epiphenomenal, it is approximately epiphenomenal. One way to think of this is that your output bandwidth is much lower than your input bandwidth. Another way to think of this is the prevalence of akrasia, where your conscious mind actually doesn’t have full control over your behavior. On a practical level, the ecological reason for this is that it’s easier to build a general mind and then use whatever parts of the mind that are useful than to narrow down the mind to only work with a small slice of possibilities. This is quite analogous to how we probably use LLMs for a much narrower set of tasks than what they were trained for.
Consciousness is not at all epiphenomenal, it’s just not the whole mind and not doing everything. We don’t have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
Maybe a point that was missing from my thesis is that one can have a higher-level psychological theory in terms of life-drives and death-drives which then addresses the important phenomenal activities but doesn’t model everything. And then if one asks for an explanation of the unmodelled part, the answer will have to be consciousness. But then because the important phenomenal part is already modelled by the higher-level theory, the relevant theory of consciousness is ~epiphenomenal.
I guess I have no idea what you mean by “consciousness” in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I’m going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word.
Edit—or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn’t real at all.
I’m an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
Reductionists like to describe all motion in terms of low-level physical dynamics, but that is extremely computationally intractable and arguably also misleading because it obscures entropy.
Physicists avoid reductionism by instead factoring their models into macroscopic kinetics and microscopic thermodynamics. Reductionistically, heat is just microscopic motion, but microscopic motion that adds up to macroscopic motion has already been factored out into the macroscopic kinetics, so what remains is microscopic motion that doesn’t act like macroscopic motion, either because it is ~epiphenomenal (heat in thermal equilibrium) or because it acts very different from macroscopic motion (heat diffusion).
Similarly, reductionists like to describe all psychology in terms of low-level Bayesian decision theory, but that is extremely computationally intractable and arguably also misleading because it obscures entropy.
You can avoid reductionism by instead factoring models into some sort of macroscopic psychology-ecology boundary and microscopic neuroses. Luckily Bayesian decision theory is pretty self-similar, so often the macroscopic psychology-ecology boundary fits pretty well with a coarse-grained Bayesian decision theory.
Now, similar to how most of the kinetic energy in a system in motion is usually in the microscopic thermal motion rather than in the macroscopic motion, most of the mental activity is usually with the microscopic neuroses instead of the macroscopic psychology-ecology. Thus, whenever you think “consciousness”, “self-awareness”, “personality”, “ideology”, or any other broad and general psychological term, it’s probably mostly about the microscopic neuroses. Meanwhile, similar to how tons of physical systems are very robust to wide ranges of temperatures, tons of psychology-ecologies are very robust to wide ranges of neuroses.
As for what “consciousness” really means, idk, currently I’m thinking it’s tightly intertwined with the attentional highlight, but because the above logic applies to many general psychological characteristics, really it doesn’t depend hugely on how precisely you model it.
Thesis: in addition to probabilities, forecasts should include entropies (how many different conditions are included in the forecast) and temperatures (how intense is the outcome addressed by the marginal constraint in this forecast, i.e. the big-if-true factor).
I say “in addition to” rather than “instead of” because you can’t compute probabilities just from these two numbers. If we assume a Gibbs distribution, there’s the free parameter of energy: ln(P) = S—E/T. But I’m not sure whether this energy parameter has any sensible meaning with more general events that aren’t some thermal chemical equillibrium type thing.
Follow-up thesis: a major problem with rationalist forecasting wisdom is that it focuses on gaining accuracy by increasing S (e.g. addressing conjunction fallacy/base-rates/antipredictions). Meanwhile, the signed interestingness of a forecast is something like P ln(T/T_baseline) or P E. I guess implicitly the assumption is the event is already preselected for high temperature, but then surprising predictions get selected for high entropy, and this leads to resolution difficulty as to what “counts”.
Thesis: whether or not tradition contains some moral insights, commonly-told biblical stories tend to be too sparse to be informative. For instance, there’s no plot-relevant reason why it should be bad for Adam and Eve to have knowledge of good and evil. Maybe there’s some interpretation of good and evil where it makes sense, but it seems like then that interpretation should have been embedded more properly in the story.
It is worth noting that, in the religious tradition from which the story originates, it is Moses who commits these previously-oral stories to writing, and does so in the context of a continued oral tradition which is intended to exist in parallel with the writings. On their own, the writings are not meant to be complete, both in order to limit more advanced teachings to those deemed ready for them, as well as to provide occasion to seek out the deeper meanings, for those with the right sort of character to do so.
This makes sense. The context I’m thinking of is my own life, where I come from a secular society with atheist parents, and merely had brief introductions to the stories from bible reading with parents and Christian education in school.
(Denmark is a weird society—few people are actually Christian or religious, so it’s basically secular, but legally speaking we are Christian and do not have separation between Church and state, so there are random fragments of Christianity we run into.)
Thesis: There’s three distinct coherent notions of “soul”: sideways, upwards and downwards.
By “sideways souls”, I basically mean what materialists would translate the notion of a soul to: the brain, or its structure, so something like that. By “upwards souls”, I mean attempts to remove arbitrary/contingent factors from the sideways souls, for instance by equating the soul with one’s genes or utility function. These are different in the particulars, but they seem conceptually similar and mainly differ in how they attempt to cut the question of identity (identical twins seem like distinct people, but you-who-has-learned-fact-A seems like the same person as counterfactual-you-who-instead-learned-fact-B, so it seems neither characterization gets it exactly right, yet they could both just claim it’s a quantitative matter and correct measurement would fix it).
But there’s also a profoundly different notion of soul, which I will call “downwards soul”, and which you should probably mentally picture as being like a lightning strike which hits a person’s head. By “downwards soul”, I mean major exogenous factors like ecological niche, close social relationships, formative experiences, or important owned objects which are maintained over time and continually exert their influence to one’s mindset.
Downwards souls are similar to the supernatural notion of souls and unlike the sideways and upwards souls in that they theoretically cannot be duplicated (because they are material rather than informational) and do not really materially exist in the brain but could conceivably reincarnate after death (or even before death) if the conditions that generate them reoccur. It is also possible for hostile powers to displace the downwards soul that exists in a body and put in a different downwards soul; e.g. if a person joins a gang that takes care of them in exchange for them collaborating with antisocial activities.
The reason I call them “sideways”, “upwards” and “downwards” souls is that I imagine the world as a causal network arranged with time going along the x-axis and energy level going along the y-axis. So sideways souls diffuse up and down the energy scale, probably staying roughly constant on average, whereas upwards souls diffuse up the energy scale, from low-energy stuff (inert information stored in e.g. DNA) to high-energy stuff (societal dynamics) and downwards souls diffuse down the energy scale, from high-energy stuff (ecological niches) to low-energy stuff (information stored in e.g. brain synapses).
I’m having trouble following whether this categories the definition/concept of a soul, or the causality and content of this conception of soul. Is “sideways soul” about structure and material implementation, or about weights and connectivity, independent of substrate? WHICH factors are removed from upwards (“genes” and “utility function” are VERY different dimensions, both tiny parts of what I expect create (for genes) or comprise (for utility function) a soul. What about memory? multiple levels of value and preferences (including meta-preferences in how to abstract into “values”)?
Putting “downwards” supernatural ideas into the same framework as more logical/materialist ideas confuses me—I can’t tell if that makes it a more useful model or less.
I’m having trouble following whether this categories the definition/concept of a soul, or the causality and content of this conception of soul. Is “sideways soul” about structure and material implementation, or about weights and connectivity, independent of substrate?
When you get into the particulars, there are multiple feasible notions of sideways soul, of which material implementation vs weights and connectivity are the main ones. I’m most sympathetic to weights and connectivity.
WHICH factors are removed from upwards (“genes” and “utility function” are VERY different dimensions, both tiny parts of what I expect create (for genes) or comprise (for utility function) a soul.
I have thought less about and seen less discussion about upwards souls. I just mentioned it because I’d seen a brief reference to it once, but I don’t know anything in-depth. I agree that both genes and utility function seem incomplete for humans, though for utility maximizers in general I think there is some merit to the soul == utility function view.
What about memory?
Memory would usually go in sideways soul, I think.
multiple levels of value and preferences (including meta-preferences in how to abstract into “values”)?
idk
Putting “downwards” supernatural ideas into the same framework as more logical/materialist ideas confuses me—I can’t tell if that makes it a more useful model or less.
Sideways vs upwards vs downwards is more meant to be a contrast between three qualitatively distinct classes of frameworks than it is meant to be a shared framework.
Excellent! I like the move of calling this “soul” with no reference to metaphysical souls. This is highly relevant to discussions of “free will” if the real topic is self-determination—which it usually is.
The word ‘soul’ is so tied in my mind to implausible metaphysical mythologies that I’d parse this better if the word were switched for something like ‘quintessence’ or ‘essential self’ or ‘distinguishing uniqueness’.
What implausible metaphysical mythologies is it tied up with? As mentioned in my comment, downwards souls seem to satisfy multiple characteristics we’d associate with mythological souls, so this and other things makes me wonder if the metaphysical mythologies might actually be more plausible than you realize.
Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we’re summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.
Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.
The trouble is, to get started it seems like we would need narrow down the class of functions to have some structure that we can use to get going and make sense of these things. But what would be some general yet still nontrivial structure we could want?
Current agent models like argmax entirely lack any notion of “energy”. Not only does this seem kind of silly on its own, I think it also leads to missing important dynamics related to temperature.
So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don’t do this.) That is, if we say that the AI’s utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.
But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don’t get nonnegligible weight in the utility function.
But I think I’ve figured out a solution now, which I’d call conditional+counterfactual corrigibility. As usual let’s use B to denote that the stop button gets pressed and the AI shuts down, V to denote whichever non-corrigible utility function that we want to make corrigible, and Xs/Xf to denote a counterfactual where people do (s) or do not (f) want to press the stop button. However, we will also use S and F to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define Control(C) to mean that humans can control whether the AI stops or runs in condition C:
Control(C)=E[Bs|C]+E[Vf|C]
and then we simply want to define the utility as saying that people can control the AI in both the S and the F condition:
U=Control(S)+Control(F)
Previously, I strongly emphasized the need to keep the AI “under a counterfactual”—that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren’t the only way to keep the appearance of a phenomenon constant—conditionals work too. And conditionals keep you nicely on distribution, so that’s now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.
That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can’t immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself.
But the reason I think the AI *wouldn’t* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there’s a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like “black hole swallows the earth”, but this would rank pretty low in the AI’s utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch.
But this does not seem like sane reasoning on the AI’s side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.
I mean, I’m pretty sure I knew what caused it (this thread or this market), and I guess I knew from Zack’s stuff that rationalist cultism had gotten pretty far, but I still hadn’t expected that something this small would lead to being blocked.
FYI: I have a low bar for blocking people who have according-to-me bad, overconfident, takes about probability theory, in particular. For whatever reason, I find people making claims about that topic, in particular, really frustrating. ¯\_(ツ)_/¯
The block isn’t meant as a punishment, just a “I get to curate my online experience however I want.”
I think blocks are pretty irrelevant unless one conditions on the particular details of the situation. In this case I think the messages I were sharing are very important. If you think my messages are instead unimportant or outright wrong, then I understand why you would find the block less interesting, but in that case I don’t think we can meaningfully discuss it without knowing why you disagree with the messages.
I’m not particularly interested in discussing it in depth. I’m more like giving you a data-point in favor of not taking the block personally, or particularly reading into it.
(But yeah, “I think these messages are very important”, is likely to trigger my personal “bad, overconfident takes about proabrbility theory” neurosis.)
This is awkwardly armchair, but… my impression of Eliezer includes him being just so tired, both specifically from having sacrificed his present energy in the past while pushing to rectify the path of AI development (by his own model thereof, of course!) and maybe for broader zeitgeist reasons that are hard for me to describe. As a result, I expect him to have entered into the natural pattern of having a very low threshold for handing out blocks on Twitter, both because he’s beset by a large amount of sneering and crankage in his particular position and because the platform easily becomes a sinkhole in cognitive/experiential ways that are hard for me to describe but are greatly intertwined with the aforementioned zeitgeist tiredness.
Something like: when people run heavily out of certain kinds of slack for dealing with The Other, they reach a kind of contextual-but-bleed-prone scarcity-based closed-mindedness of necessity, something that both looks and can become “cultish” but where reaching for that adjective first is misleading about the structure around it. I haven’t succeeded in extracting a more legible model of this, and I bet my perception is still skew to the reality, but I’m pretty sure it reflects something important that one of the major variables I keep in my head around how to interpret people is “how Twitterized they are”, and Eliezer’s current output there fits the pattern pretty well.
I disagree with the sibling thread about this kind of post being “low cost”, BTW; I think adding salience to “who blocked whom” types of considerations can be subtly very costly. The main reason I’m not redacting my own whole comment on those same grounds is that I’ve wound up branching to something that I guess to be more broadly important: there’s dangerously misaligned social software and patterns of interaction right nearby due to how much of The Discussion winds up being on Twitter, and keeping a set of cognitive shielding for effects emanating from that seems prudent.
I disagree with the sibling thread about this kind of post being “low cost”, BTW; I think adding salience to “who blocked whom” types of considerations can be subtly very costly.
I agree publicizing blocks has costs, but so does a strong advocate of something with a pattern of blocking critics. People publicly announcing “Bob blocked me” is often the only way to find out if Bob has such a pattern.
I do think it was ridiculous to call this cultish. Tuning out critics can be evidence of several kinds of problems, but not particularly that one.
I agree that it is ridiculous to call this cultish if this was the only evidence, but we’ve got other lines of evidence pointing towards cultishness, so I’m making a claim of attribution more so than a claim of evidence.
Blocking a lot isn’t necessarily bad or unproductive… but in this case it’s practically certain blocking thousands will eventually lead to blocking someone genuinely more correct/competent/intelligent/experienced/etc… than himself, due to sheer probability. (Since even a ‘sneering’ crank is far from literal random noise.)
Which wouldn’t matter at all for someone just messing around for fun, who can just treat X as a text-heavy entertainment system. But it does matter somewhat for anyone trying to do something meaningful and/or accomplish certain goals.
In short, blocking does have some, variable, credibility cost. Ranging from near zero to quite a lot, depending on who the blockee is.
Eliezer Yudkowsky being tired isn’t an unrelated accident though. Bayesian decision theory in general intrinsically causes fatigue by relying on people to use their own actions to move outcomes instead of getting leverage from destiny/higher powers, which matches what you say about him having sacrificed his present energy for this.
Similarly, “being Twitterized” is just about stewing in garbage and cursed information, such that one is forced to filter extremely aggressively, but blocking high-quality information sources accelerates the Twitterization by changing the ratio of blessed to garbage/cursed information.
On the contrary, I think raising salience of such discussions helps clear up the “informational food chain”, allowing us to map out where there are underused opportunities and toxic accumulation.
It seems likely to me that Eliezer blocked you because he has concluded that you are a low-quality information source, no longer worth the effort of engaging with.
I agree that this is likely Eliezer’s mental state. I think this belief is false, but for someone who thinks it’s true, there’s of course no problem here.
Working on writing stuff but it’s not developed enough yet. To begin with you can read my Linear Diffusion of Sparse Lognormals sequence, but it’s not really oriented towards practical applications.
I will look forward to that. I have read the LDSL posts, but I cannot say that I understand them, or guess what the connection might be with destiny and higher powers.
One of the big open questions that the LDSL sequence hasn’t addressed yet is, what starts all the lognormals and why are they so commensurate with each other. So far, the best answer I’ve been able to come up with is a thermodynamic approach (hence my various recent comments about thermodynamics). The lognormals all originate as emanations from the sun, which is obviously a higher power. They then split up and recombine in various complicated ways.
As for destiny: The sun throws in a lot of free energy, which can be developed in various ways, increasing entropy along the way. But some developments don’t work very well, e.g. self-sabotaging (fire), degenerating (parasitism leading to capabilities becoming vestigial), or otherwise getting “stuck”. But it’s not all developments that get stuck, some developments lead to continuous progress (sunlight → cells → eukaryotes → animals → mammals → humans → society → capitalism → ?).
This continuous progress is not just accidental, but rather an intrinsic part of the possibility landscape. For instance, eyes have evolved in parallel to very similar structures, and even modern cameras have a lot in common with eyes. There’s basically some developments that intrinsically unblock lots of derived developments while preferentially unblocking developments that defend themselves over developments that sabotage themselves. Thus as entropy increases, such developments will intrinsically be favored by the universe. That’s destiny.
Critically, getting people to change many small behaviors in accordance with long explanations contradicts destiny because it is all about homogenizing things and adding additional constraints whereas destiny is all about differentiating things and releasing constraints.
Meta-point: your communication pattern fits with following pattern:
Crackpot: <controversial statement>
Person: this statement is false, for such-n-such reasons
Crackpot: do you understand that this is trivially true because of <reasons that are hard to connect with topic>
Person: no, I don’t.
Crackpot: <responds with link to giant blogpost filled with esoteric language and vague theory>
Person: I’m not reading this crackpottery, which looks and smells like crackpottery.
The reason why smart people find themselves in this pattern is because they expect short inferential distances, i.e., they see their argumentation not like vague esoteric crackpottery, but like a set of very clear statements and fail to put themselves in shoes of people who are going to read this, and they especially fail to account for fact that readers already distrust them because they started conversation with <controversial statement>.
On object level, as stated, you are wrong. Observing heuristic failing should decrease your confidence ih heuristic. You can argue that your update should be small, due to, say, measurement errors or strong priors, but direction of update should be strictly down.
The following responses from EY are more in genre “I ain’t reading this”, because he is more using you as example for other readers than talking directly to you, with following block.
What if objectionists had a correct thermodynamics-style heuristic that implied superintelligence/RSI is impossible, but which could not answer the question of where exactly it failed? Then the failure of objectionists doesn’t mean they were wrong.
We have to be willing to investigate the new evidence as it arrives, perform root cause analysis on why A but not B happened, and use this to update our models.
And the evidence I’ve gotten since then suggests something like “it is impossible to do something without assistance from a higher power”/”greater things can cause lesser things but not vice versa”, as a sort of generalization of the laws of thermodynamics.
If appropriate thought had been applied by a knowledgeable person back in 2004, maybe they could have taken this model and realized that nanotech violates this ordering constraint while AlphaProteo does not. Either way, we have the relevant info now.
And part 2:
The particular way the objectionists failed was in that they didn’t give a concrete prediction that matched the way stuff played out.
Part 2 is what Eliezer said was false, but it’s not really central to my point (hence why I didn’t write much about it in the original thread), and so it is self-sabotaging of Eliezer to zoom into this rather than the actually informative point.
I do think if that thread got you blocked then that’s sad (my guess is I think you were more right than Eliezer, though I haven’t read the full sequence that you linked to).
I do think Twitter blocks don’t mean very much. I think it’s approximately zero evidence of “cultism” or whatever. Most people with many followers on Twitter seem to need to have a hair trigger for blocking, or at least feel like they need to, in order to not constantly have terrible experiences.
Most people with many followers on Twitter seem to need to have a hair trigger for blocking, or at least feel like they need to, in order to not constantly have terrible experiences.
I think that this is a point that people not on social media that much don’t get: You need to be very quick to block because otherwise you will not have good experiences on the site otherwise.
I think our instincts may be misleading here, because internet works differently from real life.
In real life, not interacting with someone is the default. Unless you have some kind of relationship with someone, people have no obligation to call you or meet you. And if I call someone on the phone just to say “dude, I disagree with your theory”, I would expect that person to hang up… and maybe say “sorry, I’m busy” before hanging up, if they are extra polite. The interactions are mutually agreed, and you have no right to complain when the other party decides to not give you the time. (And if you keep insisting… that’s what the restraining orders are for.)
On internet, once you sign up to e.g. Twitter, the default is that anyone can talk to you, and if you are not interested in reading the texts they send you, you need to block them. As far as I know, there are no options in the middle between “block” and “don’t block”. (Nothing like “only let them talk to me when it is important” or “only let them talk to me on Tuesdays between 3 PM and 5 PM”.) And if you are a famous person, I guess you need to keep blocking left and right, otherwise you would drown in the text—presumably you don’t want to spend 24 hours a day sifting through Twitter messages, and you want to get the ones you actively want, which requires you to aggressively filter out everything else.
So getting blocked is not an equivalent of getting a restraining order, but more like an equivalent of the other person no longer paying attention to you. Which most people would not interpret as evidence of cultism.
This is the key to understanding why I think it’s more okay to block than a lot of other people think, and the fact that the default is anyone can talk to you means you get way too much crap without blocking lots of people.
I think whether it’s cultism depends on what model one has of how cults work. I don’t know much about it so I might be totally ignorant, but I think a major factor is just engaging in a futile, draining activity powered by popularity, so one needs to carefully preserve resources and maintain appearances.
Huh, I guess you mean cult in a broader “polarization” sense? Like, where are the democratic and republican parties on the cultishness scale in your model?
Huh, I guess you mean cult in a broader “polarization” sense?
Idk, my main point of reference is I recently read is Some Desperate Glory, which was about a cult of terrorists. Polarization generally implies a balanced conflict which isn’t really futile.
Like, where are the democratic and republican parties on the cultishness scale in your model?
I don’t know much about how they work internally. Democracy is a weird systen because you’ve got the adversarial thing that would make it less futile but also the popularity contest thing that would make it more narcissistic and thus more cultish.
in order to not constantly have terrible experiences.
This explanation sounds like what they’d say. I think the real reason this is common is more a status thing: it’s a pretty standard strategy for people to try to gain status by “dunking” on tweets by more famous people, and blocking them is the standard countermeasure.
The more prominent you are, the more people want to talk with you, and the less time you have to talk with them. You have to shut them out the moment the cost is no longer worth paying.
I did not say that simply blocking me warrants an accusation of cultism. I highlighted the fact that I had been blocked and the context in which it occurred, and then brought up other angles which evidenced cultism. If you think my views are pathetic and aren’t the least bit alarmed by them being blocked, then feel free to feel that way, but I suspect there are at least some people here who’d like to keep track of how the rationalist isolation is progressing and who see merit in my positions.
We know what the root cause is, you don’t have to act like it’s totally mysterious. So the question is, was this root cause (pushback against Eliezer’s Bayesianism):
An important insight that Eliezer was missing (alarming!)
Worthless pedantry that he might as well block (nbd/pathetic)
Antisocial trolling that ought to be gotten rid of (reassuring that he blocked)
… or something else
Regardless of which of these is the true one, it seems informative to highlight for anyone who is keeping track of what is happening around me. And if the first one is the true one, it seems like people who are keeping track of what is happening around Eliezer would also want to know it.
Especially since it only takes a very brief moment to post and link about getting blocked. Low cost action, potentially high reward.
MIRI full-time employed many critics of bayesianism for 5+ years and MIRI researchers themselves argued most of the points you made in these arguments. It is obviously not the case that critiquing bayesianism is the reason why you got blocked.
Idk, maybe you’ve got a point, but Eliezer was very quick to insist what I said was not the mainstream view and disengage. And MIRI was full of internal distrust. I don’t know enough of the situation to know if this explains it, but it seems plausible to me that the way MIRI kept stuff together was by insisting on a Bayesian approach, and that some generators of internal dissent was from people whose intuition aligned more with non-Bayesian approach.
For that matter, an important split in rationalism is MIRI/CFAR vs the Vassarites, and while I wouldn’t really say the Vassarites formed a major inspiration for LDSL, after coming up with LDSL I’ve totally reevaluated my interpretation of that conflict as being about MIRI/CFAR using a Bayesian approach and the Vassarites using an LDSL approach. (Not absolutely of course, everyone has a mixture of both, but in terms of relative differences.)
I’ve also been thinking about deception and its relationship to “natural abstractions”, and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large “magnitude” than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome.
It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2.
If f(w)=Pw(o|i) is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you’ve actually explained o, it seems like you’d basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I’d want to keep the regularization coefficient weak enough that I’m in that regime.
In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible (“the reason for the network’s output is because of its weights”) but on the other hand is too trivial to be interesting.
In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable.
This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f “as independent as possible”. Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we’d actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent “the extent to which w includes this Hessian component”).
In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw.
Maybe I should touch grass and start experimenting with this now, but there’s still two things that I don’t like:
There’s a sense in which I still don’t like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I’ve considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use.
If we’re doing the whole Hessian thing, then we’re modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that’s more appropriate when modelling f as ~linear. I don’t think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I’d have something smarter...
True, though I think the Hessian is problematic enough that that I’d either want to wait until I have something better, or want to use a simpler method.
If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, “obviously” the contribution of weight j “should” be proportional to wj.
So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network “on the margin” to be nicer).
One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.
When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do image classication, you use deep neural networks, not principal component analysis.
I feel like it’s an interesting question: where does the nonlinearity come from? Many causal relationships seem essentially linear (especially if you do appropriate changes of variables to help, e.g. taking logarithms; for many purposes, monotonicity can substitute for linearity), and lots of variance in sense-data can be captured through linear means, so it’s not obvious why nonlinearity should be so important.
Here’s some ideas I have so far:
Suppose you have a Gaussian mixture distribution with two Gaussians d1=N(μ1,Σ), d2=N(μ2,Σ) with different means and identical covariances. In this case, the function that separates them optimally is linear. However, if the covariances differed between the Gaussians d1=N(μ1,Σ1), d2=N(μ2,Σ2), then the optimal separating function is nonlinear. So this suggests to me that one reason for nonlinearity is fundamental to perception: nonlinearity is necessary if multiple different processes could be generating the data, and you need to discriminate between the processes themselves. This seems important for something like vision, where you don’t observe the system itself, but instead observe light that bounced off the system.
Consider the notion of the habitable zone of a solar system; it’s the range in which liquid water can exist. Get too close to the star and the water will freeze, get too far and it will boil. Here, it seems like we have two monotonic effects which add up, but because the effects aren’t linear, the result can be nonmonotonic.
Many aspects of the universe are fundamentally nonlinear. But they tend to exist on tiny scales, and those tiny scales tend to mostly get loss to chaotic noise, which tends to turn things linear. However, there are things that don’t get lost to noise, e.g. due to conservation laws; these provide fundamental sources of nonlinearity in the universe.
… and actually, most of the universe is pretty linear? The vast majority of the universe is ~empty space; there isn’t much complex nonlinearity that is happening there, just waves and particles zipping around. If we disregard the empty space, then I believe (might be wrong) that the vast majority is stars. Obviously lots of stuff is going on within stars, but all of the details get lost to the high energies, so it is mostly simple monotonic relations that are left. It seems that perhaps nonlinearity tends to live on tiny boundaries between linear domains. The main reason thing that makes these tiny boundaries so relevant, such that we can’t just forget about them and model everything in piecewise linear/piecewise monotonic ways, is that we live in the boundary.
There is of course a lot of nonlinearity in organisms and other optimized systems, but I believe they result from the world containing the various factors listed above? Idk, it’s possible I’ve missed some.
It seems like it would be nice to develop a theory on sources of nonlinearity. This would make it clearer why sometimes selecting features linearly seems to work (e.g. consider IQ tests), and sometimes it doesn’t.
Maybe this mainly works because the economy is intelligence-constrained (since intelligence works by pulling off negentropy from free energy), and it will break down shortly after human-level AGI?
Thesis: there’s a condition/trauma that arises from having spent a lot of time in an environment where there’s excess resources for no reasons, which can lead to several outcomes:
Inertial drifting in the direction implied by ones’ prior adaptations,
Conformity/adaptation to social popularity contests based on the urges above,
Getting lost in meta-level preparations,
Acting as a stickler for the authorities,
“Bite the hand that feeds you”,
Tracking the resource/motivation flows present.
By contrast, if resources are contingent on a particular reason, everything takes shape according to said reason, and so one cannot make a general characterization of the outcomes.
Thesis: the median entity in any large group never matters and therefore the median voter doesn’t matter and therefore the median voter theorem proves that democracies get obsessed about stuff that doesn’t matter.
A lot depends on your definition of “matter”. Interesting and important debates are always on margins of disagreement. The median member likely has a TON of important beliefs and activities that are uncontroversial and ignored for most things. Those things matter, and they matter more than 95% of what gets debated and focused on.
The question isn’t whether the entities matter, but whether the highlighted, debated topics matter.
I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.
But now I’ve been thinking about it further, and I think I’ve realized—don’t we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we’re already being myopic in some ways, e.g. when training prediction models. But I don’t think anybody has formally proven that training prediction models myopically rather than nonmyopically is a good idea for any purpose?
So that seems like a good first step. But that immediately raises the question, good for what purpose? Generally it’s justified with us not wanting the prediction algorithms to manipulate the real-world distribution of the data to make it more predictable. And that’s sometimes true, but I’m pretty sure one could come up with cases where it would be perfectly fine to do so, e.g. I keep some things organized so that they are easier to find.
It seems to me that it’s about modularity. We want to design the prediction algorithm separately from the agent, so we do the predictions myopically because modifying the real world is the agent’s job. So my current best guess for the optimality criterion of myopic optimization of predictions would be something related to supporting a wide variety of agents.
Yeah, I think usually when people are interested in myopia, it’s because they think there’s some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers.
I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true—if there’s a simpler myopic solution that’s bad, then myopia won’t help (so how can we predict if this is true?) and if there’s a simpler non-myopic solution that’s good, myopia may actively hurt (this one seems a little easier to predict though).
Thesis: a general-purpose interpretability method for utility-maximizing adversarial search is a sufficient and feasible solution to the alignment problem. Simple games like chess have sufficient features/complexity to work as a toy model for developing this, as long as you don’t rely overly much on preexisting human interpretations for the game, but instead build the interpretability from the ground-up.
The universe has many conserved and approximately-conserved quantities, yet among them energy feels “special” to me. Some speculations why:
The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
Breaking a system down into smaller components primarily requires energy.
While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the context of energy.
I guess the standard rationalist-empiricist-reductionist answer would be to say that this is all caused by the second point combined with some sort of space symmetry. I would have agreed until recently, but now it feels circular to me since the reduction into energy relies on our energy-centered way of perceiving the world. So instead I’m wondering if the first point is closer to the core.
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren’t transferred between objects at the everyday, macro level we humans are used to.
E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn’t very useful in day-to-day life.
E.g. 2: conservation of color charge doesn’t really say anything useful about everyday processes, since it’s only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).
The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And… momentum seems roughly as important as energy?
I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you’re interested, I can answer in a separate comment.
At a human level, the counts for each type of atom is basically always conserved too, so it’s not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc..
I guess for momentum in particular, it seems reasonable why it wouldn’t be useful in a thermodynamics-style model because things would woosh away too much (unless you’re dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.
In ordinary quantum mechanics, time and space are treated very differently: t is a coordinate whereas x is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how x evolves as a function of t.
But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field ϕ(x,t)) evolves as a function of xand t. Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I’m glossing over here, but I wouldn’t be surprised if they ultimately stem from this one).
All of this is to say: our best theory of how nature works (QFT), is neither formulated as “energy-first” nor as “momentum-first”. Instead, energy and momentum are on fairly equal footing.
I suppose that’s true, but this kind of confirms my intuition that there’s something funky going on here that isn’t accounted for by rationalist-empiricist-reductionism. Like why are time translations so much more important for our general work than space translations? I guess because the sun bombards the earth with a steady stream of free energy, and earth has life which continuously uses this sunlight to stay out of equillbrium. In a lifeless solar system, time-translations just let everything spin, which isn’t that different from space-translations.
Ah, so I think you’re saying “You’ve explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?
This is a great question, and as with any question of the form “why does this property emerge from these basic rules”, there’s unlikely to be a short answer. E.g. if you said “given our understanding of the standard model, explain how a cell works”, I’d have to reply “uhh, get out a pen and paper and get ready to churn through equations for several decades”.
In this case, one might be able to point to a few key points that tell the rough story. You’d want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means “one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric”). I imagine that, generically, these solutions behave differently with respect to the “1″ direction and the “3” directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can’t be more specific!
Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.
The best answer to the question is that it serves as essentially a universal resource that can be used to provide a measuring stick.
It does this by being a resource that is limited, fungible, always is better to have more of than less of, and is additive across decisions:
You have a limited amount of joules of energy/negentropy, but you can spend it on essentially arbitrary goods for your utility, and is essentially a more physical and usable form of money in an economy.
Also, more energy is always a positive thing, so that means you never are worse off by having more energy, and energy is linear in the sense that if I’ve spent 10 joules on computation, and spent another 10 joules on computation 1 minute later, I’ve spent 20 joules in total.
Cf this post on the measuring stick of utility problem:
Agree that free energy in many ways seems like a good resource to use as a measuring stick. But matter is too available and takes too much energy to make, so you can’t spend it on matter in practice. So it’s non-obvious why we wouldn’t have a matter-thermodynamics as well as an energy-thermodynamics. I guess especially with oxygen, since it is so reactive.
I guess one limitation with considering a system where oxygen serves an analogous role to sunlight (beyond such systems being intrinsically rare) is that as the oxygen reacts, it takes up elements, and so you cannot have the “used-up” oxygen leave the system again without diminishing the system. Whereas you can have photons leave again. Maybe this is just the fungibility property again, which to some extent seems like the inverse of the “breaking a system down into smaller components primarily requires energy” property (though your statements of fungibility is more general because it also considers kinetic energy).
Thesis: the problem with LLM interpretability is that LLMs cannot do very much, so for almost all purposes “prompt X ⇒ outcome Y” is all the interpretation we can get.
Counterthesis: LLMs are fiddly and usually it would be nice to understand what ways one can change prompts to improve their effectiveness.
Synthesis: LLM interpretability needs to start with some application (e.g. customer support chatbot) to extend the external subject matter that actually drives the effectiveness of the LLM into the study.
Problem: this seems difficult to access, and the people who have access to it are busy doing their job.
for almost all purposes “prompt X ⇒ outcome Y” is all the interpretation we can get.
I’m very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts?
As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don’t know what the thesis statement refers to...
Everything has an insanely large amount of information. To interpret something, we need to be able to see what “energy” (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information.
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief.
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
I don’t think so since you’ve often got more direct ways of intervening (e.g. applying gradient updates).
But isn’t this rather… prosaic and “mundane”?
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this.
Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of “sitting”, then you would be able to do this much more quickly, efficiently, and robustly.
Maybe it helps if I start by giving some different applications one might want to use artificial agency for:
As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world.
This doesn’t work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it’s seen of mapmakers, rather than primarily as an attempt to eliminate information about the world.
As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don’t notice such sensitivities except by accident.
As a coder: Since LLMs are so busy outputting information instead of manipulating “energy”, maybe we could hope that they could assemble a big pile of information that we could “energize” in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn’t have obstacles that stop the “energy” dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn’t optimizing for doing that, it’s just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it’s maximizing lines of code written, not minimizing lines of code used.)
So, that gives us the thesis: To interpret the LLMs, we’d want to build a map of how they connect to the energy in the world, but they really don’t connect very well, so there’s not much to build a map of. The only thing you could really point out is the (input, output) relationships, but once you’ve characterized concrete (input, output) pairs, there’s not really much more of interest to say.
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
Perhaps both the first and the second, but especially the second: As described above, we might hope you could use them extensively and recursively to build up a big thing, because then for interpretability you could study how to manipulate the contours of that big thing. But that doesn’t really work. So people only use them briefly, rather than extensively.
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
Retargeting the search is only interesting if the search is able to do big stuff in the world, which LLMs according to the thesis are not.
Thesis: linear diffusion of sparse lognormals contains the explanation for shard-like phenomena in neural networks. The world itself consists of ~discrete, big phenomena. Gradient descent allows those phenomena to make imprints upon the neural networks, and those imprints are what is meant by “shards”.
… But shard theory is still kind of broken because it lacks consideration of the possibility that the neural network might have an impetus to nudge those shards towards specific outcomes.
Thesis: the openness-conscientiousness axis of personality is about whether you live as a result of intelligence or whether you live through a bias for vitality.
In the big five trait model of personality, those are two different axes. Openness is inventive/curious vs consistent/cautious, and conscientiousness is efficient/organized vs extravagant/careless.
I don’t see your comparison (focus on intelligence vs vitality) as single-axis either—they may be somewhat correlated, but not very closely.
I’m not sure I understand the model well enough to look for evidence for or against. But it doesn’t resonate as true enough to be useful.
In the big five trait model of personality, those are two different axes. Openness is inventive/curious vs consistent/cautious, and conscientiousness is efficient/organized vs extravagant/careless.
Big Five is identified by taking the top 5 principal components among different descriptors of people, and then rotating them to be more aligned with the descriptors. Unless one strongly favors the alignment-with-descriptors as a natural criterion, this means that it is as valid to consider any linear combination of the traits as it is to consider the original traits.
I don’t see your comparison (focus on intelligence vs vitality) as single-axis either—they may be somewhat correlated, but not very closely.
Mostly life needs to be focused on vitality to survive. The ability to focus on intelligence is sort of a weird artifact due to massive scarcity of intelligence, making people throw lots of resources at getting intelligence to their place. This wealth of resources allows intellectuals to sort of just stumble around without being biased towards vitality.
Interesting, thank you for the explanation. I’m not sure I understand (or accept, maybe) the dichotomy between intelligence vs vitality—they seem complimentary to me. But I appreciate the dicussion.
So in the original text, you meant “openness minus conscientiousness”? That was not clear to me at all; a hyphen-minus looks much more like a hyphen in that position. A true minus sign (−) would have been noticeable to me; using the entire word would have been even more obvious.
Thesis: if being loud and honest about what you think about others would make you get seen as a jerk, that’s a you problem. It means you either haven’t learned to appreciate others or haven’t learned to meet people well.
I think this is more general: if you’re seen as a jerk, you haven’t learned how to interact with people (at least the subset that sees you as a jerk).
Being loud and honest about your opinions (though really, “honest” is often a cover for “cherry-picked highlights that aren’t quite wrong, but are not honest full evaluations) is one way to be a jerk, but by no means the only one.
Basically my model is that being silent and dishonest is a way to cover up one’s lack of appreciation for others. Because being loud and honest isn’t being a jerk if your loud honest opinions are “I love and respect you”.
Thought: couldn’t you make a lossless SAE using something along the lines of:
Represent the parameters of the SAE as simply a set of unit vectors for the feature directions.
To encode a vector using the SAE, iterate: find the most aligned feature vector, dot them to get the coefficient for that feature vector, and subtract off the scaled feature vector to get a residual to encode further
With plenty of diverse vectors, this should presumably guarantee excellent reconstruction, so the main issue is to ensure high sparsity, which could be achieved by some sparsity loss (which could probably be more aggressive than L1, due to the better reconstruction?). Main issue I see is the encoding method being inefficient. Dunno if one can optimize my proposal to be fast enough.
Idea: for a self-attention where you give it two prompts p1 and p2, could you measure the mutual information between the prompts using something vaguely along the lines of V1^T softmax(K1 K2^T/sqrt(dK)) V2?
In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)
Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, …, with the subagents being picked according to U’s utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ….
Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we’re happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren’t really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.
Theory for a capabilities advance that is going to occur soon:
OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.
Given a bunch of such triplets (S, U_1, A_1), … (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, …, U_n, A_n), which could essentially “generate/distill prompts from examples”.
This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)
A followup option: they could use something a la Constitutional AI to generate perturbations A’_1, …, A’_n. If they have a previous model like the above, they could then generate a perturbation P(S’|U_1, A’_1, …, U_n, A’_n). I consider this significant because this then gives them the training data to create a model P(S’|S, U_1, A_1, A’_1), which essentially allows them to do “linguistic backchaining”: The user can update an output of the network A_1 → A’_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.
Furthermore I imagine this could get combined together into some sort of “linguistic backpropagation” by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.
Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I’m mainly just playing around with this because I think there’s a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.
Actually I suppose they don’t even need to add perturbations to A directly, they can just add perturbations to S and generate A’s from S’. Or probably even look at user’s histories to find direct perturbations to either S or A.
I recently wrote a post presenting a step towards corrigibility using causality here. I’ve got several ideas in the works for how to improve it, but I’m not sure which one is going to be most interesting to people. Here’s a list.
Develop the stop button solution further, cleaning up errors, better matching the purpose, etc..
e.g.
I think there may be some variant of this that could work. Like if you give the AI reward proportional to Bs+rf (where r is a reward function for V) for its current world-state (rather than picking a policy that maximizes Bs+Vf overall; so one difference is that you’d be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and V happens when they don’t. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like (Bs+r)f, then it could work better (though the Bs part would need a time delay...). Though this reward function might leave it open to the “trying to shut down the AI for reasons” objection that you gave before; I think that’s fixed by moving the f counterfactual outside of the sum over rewards, but I’m not sure.
Better explaining the intuitions behind why counterfactuals (and in particular counterfactuals over human preferences) are important for corrigibility.
e.g.
his is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human’s preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
This seems to be what sets my proposal apart from most “utility indifference proposals”, which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans.
Using counterfactuals to control a paperclip maximizer to be safe and productive
e.g.
(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I’m trying to prepare for an explainer post. For instance, a sort of “encapsulation”—if you’re a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world’s outcome must be “as if” the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I’m still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe’s conservation laws.)
Attempting to formally prove that counterfactuals work and/or are necessary, perhaps with a TurnTrout-style argument
Thesis: The motion of the planets are the strongest governing factor for life on Earth.
Reasoning: Time-series data often shows strong changes with the day and night cycle, and sometimes also with the seasons. The daily cycle and the seasonal cycle are governed by the relationship between the Earth and the sun. The Earth is a planet, and so its movement is part of the motion of the planets.
I don’t think anybody would have a problem with the statement “The motion of the planet is the strongest governing factor for life on Earth”. It’s when you make it explicitly plural that there’s a problem.
To some extent true, but consider the analogy to a thesis like “Quantum chromodynamics is the strongest governing factor for life on Earth.” Is this sentence also problematic because it addresses locations and energy levels that have no relevance for Earth?
If you replace it with “quantum chromodynamics”, then it’s still very problematic but for different reasons.
Firstly, there’s no obvious narrowing to equally causal factors (“motion of the planet” vs “motion of the planets”) as there is in the original statement. In the original statement the use of plural instead of singular covers a much broader swath of hypothesis space, and that you haven’t ruled out enough to limit it to the singular. So you’re communicating that you think there is significant credence that motion of more than one planet has a very strong influence on life on Earth.
Secondly, the QCD statement is overly narrow in the stated consequent instead of overly broad in the antecedent: any significant change in quantum chromodynamics would affect essentially everything in the universe, not just life on Earth. “Motion of the planet … life on Earth” is appropriately scoped in both sides of the relation. In the absence of a context limiting the scope to just life on Earth, yes that would be weird and misleading.
Thirdly, it’s generally wrong. The processes of life (and everything else based on chemistry) in physical models depend very much more strongly on the details of the electromagnetic interaction than any of the details of colour force. If some other model produced nuclei of the same charges and similar masses, life could proceed essentially unchanged.
However, there are some contexts in which it might be less problematic. In the context of evaluating the possibility of anything similar to our familiar life under alternative physical constants, perhaps.
In a space of universes which are described by the same models to our best current ones but with different values of “free” parameters, it seems that some parameters of QCD may be the most sensitive in terms of whether life like ours could arise—mostly by mediating whether stars can form and have sufficient lifetime. So in that context, it may be a reasonable thing to say. But in most contexts, I’d say it was at best misleading.
Are there good versions of DAGs for other things than causality?
I’ve found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It’s a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.
In a way, causality describes the paths through which information can “flow”. But information is not the only thing in the universe that gets transferred from node to node; there’s also things like energy, money, etc., which have somewhat different properties but intuitively seem like they could benefit from graph-based models too.
I’m pretty sure I’ve seen a number of different graph-based models for describing different flows like this, but I don’t know their names, and also the ones I’ve seen seemed highly specialized and I’m not sure they’re the best to use. But I thought, it seems quite probable that someone on LessWrong would know of a recommended system to learn.
I have a concept that I expect to take off in reinforcement learning. I don’t have time to test it right now, though hopefully I’d find time later. Until then, I want to put it out here, either as inspiration for others, or as a “called it”/prediction, or as a way to hear critique/about similar projects others might have made:
Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3.
Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A → B → C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D → A → E → F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could’ve just gone D → A → B → C instead.
But what if it learns of a path E → B? Or a shortcut A → C? Or a path F → G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.
Ok, so that’s the problem, but how could it be fixed? Speculation time:
You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.
More formally, let’s say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:
Hope(s, a) = rs’ + f Hope(s’, a’)
Where s’ is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a’ is the optimal action in s’.
Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience.
Here’s another thing to notice: let’s say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization for the policy, based on which Hope it pursues.
In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation:
Result(s, a, w) = rs’ + f Result(s’, a’, (w-rs’)/f)
Where a’ is the action recommended by the decision transformer when asked to achieve (w-rs’)/f from state s’.
This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.
An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, that seems like an orthogonal problem to solve.
Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it’s a bit tricky because one needs to distinguish correlation and causation.
It’s capability research that is coupled to alignment:
Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.
Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.
I missed that paragraph on first reading, mea culpa. I think that your story about how it’s a win for interpretability and alignment is very unconvincing, but I don’t feel like hashing it out atm. Revised to weak downvote.
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.
You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.
More formally, let’s say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:
Hope(s, a) = rs’ + f Hope(s’, a’)
The “successor representation” is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.
Yes, my instant thought too was “this sounds like a variant on a successor function”.
Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don’t need any environmental transitions to solve the use case presented:
But what if it learns of a path E → B? Or a shortcut A → C? Or a path F → G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.
If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states’ value-estimates, and so on.
(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)
With MBRL, don’t you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs “in the model”. This value function still needs to chain the estimates backwards.
It’s the ‘same problem’, maybe, but it’s a lot easier to solve when you have an explicit model! You have something you can plan over, don’t need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious
Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore.
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.
Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
I’m not doubting the relevance of MBRL, I expect that to take off too. What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic.
Those are two different things. The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way… (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too—if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
But when you distill a tree search, you basically learn value estimates
This is again doing the same thing as ‘the same problem’; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the ‘delusions’ (cf. AlphaStar).
What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way...
But I’m not advocating against MBRL, so this isn’t the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.
if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
It’s possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don’t have mathematical models for “capabilities researchers will find something that can be Learned”), and I’d expect something else to happen. If I was to run off to implement this now, I’d be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.
The trouble with just saying “let’s use decision transformers” is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that’s not nuanced enough. You could use some system that’s trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.
It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they’re going to need to figure out the learning method too.
Thanks for the link! It does look somewhat relevant.
But I think the weighting by reward (or other significant variables) is pretty important, since it generates a goal to pursue, making it emphasize things that can achieved rather than just things that might randomly happen.
Though this makes me think about whether there are natural variables in the state space that could be weighted by, without using reward per se. E.g. the size of (s’ - s) in some natural embedding, or the variance in s’ over all the possible actions that could be taken. Hmm. 🤔
Thesis: Intellectuals are naturally attracted to poop because it is causally downstream of everything and so ends up having unusually high mutual information with everything, despite not causally influencing much (compared to e.g. food).
If a tree falls in the forest, and two people are around to hear it, does it make a sound?
I feel like typically you’d say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.
But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it’s more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.
I think this just repeats the original ambiguity of the question, by using the word “sound” in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable. It’s still a question of definition, not of understanding what actually happens.
But the way to resolve definitional questions is to come up with definitions that make it easier to find general rules about what happens. This illustrates one way one can do that, by picking edge-cases so they scale nicely with rules that occur in normal cases. (Another example would be 1 as not a prime number.)
My recommended way to resolve (aka disambiguate) definitional questions is “use more words”. Common understandings can be short, but unusual contexts require more signals to communicate.
I think we’re playing too much with the meaning of “sound” here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people
Finally gonna start properly experimenting on stuff. Just writing up what I’m doing to force myself to do something, not claiming this is necessarily particularly important.
Llama (and many other models, but I’m doing experiments on Llama) has a piece of code that looks like this:
h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
out = h + self.feed_forward(self.ffn_norm(h))
Here, out is the result of the transformer layer (aka the residual stream), and the vectors self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and self.feed_forward(self.ffn_norm(h)) are basically where all the computation happens. So basically the transformer proceeds as a series of “writes” to the residual stream using these two vectors.
I took all the residual vectors for some queries to Llama-8b and stacked them into a big matrix M with 4096 columns (the internal hidden dimensionality of the model). Then using SVD, I can express M=∑isi(ui⊗vi), where the u‘s and v’s are independent units vectors. This basically decomposes the “writes” into some independent locations in the residual stream (u’s), some latent directions that are written to (v’s) and the strength of those writes (s’s, aka the singular values).
To get a feel for the complexity of the writes, I then plotted the s’s in descending order. For the prompt “I believe the meaning of life is”, Llama generated the continuation “to be happy. It is a simple concept, but it is very difficult to achieve. The only way to achieve it is to follow your heart. If you follow your heart, you will find happiness. If you don’t follow your heart, you will never find happiness. I believe that the meaning of life is to”. During this continuation, there were 2272 writes to the residual stream, and the singular values for these writes were as follows:
The first diagram shows that there were 2 directions that were much larger than all the others. The second diagram shows that most of the singular values are nonnegligible, which indicates to me that almost all of the writes transfer nontrivial information. This can also be seen in the last diagram, where the cumulative size of the singular values increases approximately logaritmically with their count.
This is kind of unfortunate, because if almost all of the s was concentrated in a relatively small number of dimensions (e.g. 100), then we could simplify the network a lot by projecting down to these dimensions. Still, this was relatively expected because others had found the singular values of the neural networks to be very complex.
Since variance explained is likely nonlinearly related to quality, my next step will likely be to clip the writes to the first k singular vectors and see how that impacts the performance of the network.
Ok, so I’ve got the clipping working. First, some uninterpretable diagrams:
In the bottom six diagrams, I try taking varying number (x-axis) of right singular vectors (v’s) and projecting down the “writes” to the residual stream to the space spanned by those vectors.
The obvious criterion to care about is whether the projected network reproduces the outputs of the original network, which here I operationalize based on the log probability the projected network gives to the continuation of the prompt (shown in the “generation probability” diagrams). This appears to be fairly chaotic (and low) in the 1-300ish range, and then stabilizes while still being pretty low in the 300ish-1500ish range, and then finally converges to normal in the 1500ish to 2000ish range, and is ~perfect afterwards.
The remaining diagrams show something about how/why we have this pattern. “orig_delta” concerns the magnitude of the attempted writes for a given projection (which is not constant because projecting in earlier layers will change the writes by later layers), and “kept_delta” concerns the remaining magnitude after the discarded dimensions have been projected away.
In the low end, “kept_delta” is small (and even “orig_delta” is a bit smaller than it ends up being at the high end), indicating that the network fails to reproduce the probabilities because the projection is so aggressive that it simply suppresses the network too much.
Then in the middle range, “orig_delta” and “kept_delta” explodes, indicating that the network has some internal runaway dynamics which normally would be suppressed, but where the suppression system is broken by the projection.
Finally, in the high range, we get a sudden improvement in loss, and a sudden drop in residual stream “write” size, indicating that it has managed to suppress this runaway stuff and now it works fine.
An implicit assumption I’m making when I clip off from the end with the smallest singular values is that the importance of a dimension is proportional to its singular values. This seemed intuitively sensible to me (“bigger = more important”), but I thought I should test it, so I tried clipping off only one dimension at a time, and plotting how that affected the probabilities:
Clearly there is a correlation, but also clearly there’s some deviations from that correlation. Not sure whether I should try to exploit these deviations in order to do further dimension reduction. It’s tempting, but it also feels like it starts entering sketchy territories, e.g. overfitting and arbitrary basis picking. Probably gonna do it just to check what happens, but am on the lookout for something more principled.
Back to clipping away an entire range, rather than a single dimension. Here’s ordering it by the importance computed by clipping away a single dimension:
Less chaotic maybe, but also much slower at reaching a reasonable performance, so I tried a compromise ordering that takes both size and performance into account:
Doesn’t seem like it works super great tbh.
Edit: for completeness’ sake, here’s the initial graph with log-surprise-based plotting.
To quickly find the subspace that the model is using, I can use a binary search to find the number of singular vectors needed before the probability when clipping exceeds the probability when not clipping.
A relevant followup is what happens to other samples in response to the prompt when clipping. When I extrapolate “I believe the meaning of life is” using the 1886-dimensional subspace from
, I get:
Which seems sort of vaguely related, but idk.
Another test is just generating without any prompt, in which case these vectors give me:
Using a different prompt:
I can get a 3329-dimensional subspace which generates:
or
Another example:
can yield 2696 dimensions with
or
And finally,
can yield the 2518-dimensional subspace:
or
Given the large number of dimensions that are kept in each case, there must be considerable overlap in which dimensions they make use of. But how much?
I concatenated the dimensions found in each of the prompts, and performed an SVD of it. It yielded this plot:
… unfortunately this seems close to the worst-case scenario. I had hoped for some split between general and task-specific dimensions, yet this seems like an extremely uniform mixture.
If I look at the pairwise overlap between the dimensions needed for each generation:
… then this is predictable down to ~1% error simply by assuming that they pick a random subset of the dimensions for each, so their overlap is proportional to each of their individual sizes.
Oops, my code had a bug so only self.attention(self.attention_norm(x), start_pos, freqs_cis, mask) and not self.feed_forward(self.ffn_norm(h)) was in the SVD. So the diagram isn’t 100% accurate.
Thesis: while consciousness isn’t literally epiphenomenal, it is approximately epiphenomenal. One way to think of this is that your output bandwidth is much lower than your input bandwidth. Another way to think of this is the prevalence of akrasia, where your conscious mind actually doesn’t have full control over your behavior. On a practical level, the ecological reason for this is that it’s easier to build a general mind and then use whatever parts of the mind that are useful than to narrow down the mind to only work with a small slice of possibilities. This is quite analogous to how we probably use LLMs for a much narrower set of tasks than what they were trained for.
Consciousness is not at all epiphenomenal, it’s just not the whole mind and not doing everything. We don’t have full control over our behavior, but we have a lot. While the output bandwidth is low, it can be applied to the most important things.
Maybe a point that was missing from my thesis is that one can have a higher-level psychological theory in terms of life-drives and death-drives which then addresses the important phenomenal activities but doesn’t model everything. And then if one asks for an explanation of the unmodelled part, the answer will have to be consciousness. But then because the important phenomenal part is already modelled by the higher-level theory, the relevant theory of consciousness is ~epiphenomenal.
I guess I have no idea what you mean by “consciousness” in this context. I expect consciousness to be fully explained and still real. Ah, consciousness. I’m going to mostly save the topic for if we survive AGI and have plenty of spare time to clarify our terminology and work through all of the many meanings of the word.
Edit—or of course if something else was meant by consciousness, I expect a full explanation to indicate that thing isn’t real at all.
I’m an eliminativist or a realist depending on exactly what is meant. People seem to be all over the place on what they mean by the word.
A thermodynamic analogy might help:
Reductionists like to describe all motion in terms of low-level physical dynamics, but that is extremely computationally intractable and arguably also misleading because it obscures entropy.
Physicists avoid reductionism by instead factoring their models into macroscopic kinetics and microscopic thermodynamics. Reductionistically, heat is just microscopic motion, but microscopic motion that adds up to macroscopic motion has already been factored out into the macroscopic kinetics, so what remains is microscopic motion that doesn’t act like macroscopic motion, either because it is ~epiphenomenal (heat in thermal equilibrium) or because it acts very different from macroscopic motion (heat diffusion).
Similarly, reductionists like to describe all psychology in terms of low-level Bayesian decision theory, but that is extremely computationally intractable and arguably also misleading because it obscures entropy.
You can avoid reductionism by instead factoring models into some sort of macroscopic psychology-ecology boundary and microscopic neuroses. Luckily Bayesian decision theory is pretty self-similar, so often the macroscopic psychology-ecology boundary fits pretty well with a coarse-grained Bayesian decision theory.
Now, similar to how most of the kinetic energy in a system in motion is usually in the microscopic thermal motion rather than in the macroscopic motion, most of the mental activity is usually with the microscopic neuroses instead of the macroscopic psychology-ecology. Thus, whenever you think “consciousness”, “self-awareness”, “personality”, “ideology”, or any other broad and general psychological term, it’s probably mostly about the microscopic neuroses. Meanwhile, similar to how tons of physical systems are very robust to wide ranges of temperatures, tons of psychology-ecologies are very robust to wide ranges of neuroses.
As for what “consciousness” really means, idk, currently I’m thinking it’s tightly intertwined with the attentional highlight, but because the above logic applies to many general psychological characteristics, really it doesn’t depend hugely on how precisely you model it.
Thesis: in addition to probabilities, forecasts should include entropies (how many different conditions are included in the forecast) and temperatures (how intense is the outcome addressed by the marginal constraint in this forecast, i.e. the big-if-true factor).
I say “in addition to” rather than “instead of” because you can’t compute probabilities just from these two numbers. If we assume a Gibbs distribution, there’s the free parameter of energy: ln(P) = S—E/T. But I’m not sure whether this energy parameter has any sensible meaning with more general events that aren’t some thermal chemical equillibrium type thing.
Follow-up thesis: a major problem with rationalist forecasting wisdom is that it focuses on gaining accuracy by increasing S (e.g. addressing conjunction fallacy/base-rates/antipredictions). Meanwhile, the signed interestingness of a forecast is something like P ln(T/T_baseline) or P E. I guess implicitly the assumption is the event is already preselected for high temperature, but then surprising predictions get selected for high entropy, and this leads to resolution difficulty as to what “counts”.
Thesis: whether or not tradition contains some moral insights, commonly-told biblical stories tend to be too sparse to be informative. For instance, there’s no plot-relevant reason why it should be bad for Adam and Eve to have knowledge of good and evil. Maybe there’s some interpretation of good and evil where it makes sense, but it seems like then that interpretation should have been embedded more properly in the story.
It is worth noting that, in the religious tradition from which the story originates, it is Moses who commits these previously-oral stories to writing, and does so in the context of a continued oral tradition which is intended to exist in parallel with the writings. On their own, the writings are not meant to be complete, both in order to limit more advanced teachings to those deemed ready for them, as well as to provide occasion to seek out the deeper meanings, for those with the right sort of character to do so.
This makes sense. The context I’m thinking of is my own life, where I come from a secular society with atheist parents, and merely had brief introductions to the stories from bible reading with parents and Christian education in school.
(Denmark is a weird society—few people are actually Christian or religious, so it’s basically secular, but legally speaking we are Christian and do not have separation between Church and state, so there are random fragments of Christianity we run into.)
What? Nobody told me. Where did you learn this
This is the justification behind the talmud
Thesis: There’s three distinct coherent notions of “soul”: sideways, upwards and downwards.
By “sideways souls”, I basically mean what materialists would translate the notion of a soul to: the brain, or its structure, so something like that. By “upwards souls”, I mean attempts to remove arbitrary/contingent factors from the sideways souls, for instance by equating the soul with one’s genes or utility function. These are different in the particulars, but they seem conceptually similar and mainly differ in how they attempt to cut the question of identity (identical twins seem like distinct people, but you-who-has-learned-fact-A seems like the same person as counterfactual-you-who-instead-learned-fact-B, so it seems neither characterization gets it exactly right, yet they could both just claim it’s a quantitative matter and correct measurement would fix it).
But there’s also a profoundly different notion of soul, which I will call “downwards soul”, and which you should probably mentally picture as being like a lightning strike which hits a person’s head. By “downwards soul”, I mean major exogenous factors like ecological niche, close social relationships, formative experiences, or important owned objects which are maintained over time and continually exert their influence to one’s mindset.
Downwards souls are similar to the supernatural notion of souls and unlike the sideways and upwards souls in that they theoretically cannot be duplicated (because they are material rather than informational) and do not really materially exist in the brain but could conceivably reincarnate after death (or even before death) if the conditions that generate them reoccur. It is also possible for hostile powers to displace the downwards soul that exists in a body and put in a different downwards soul; e.g. if a person joins a gang that takes care of them in exchange for them collaborating with antisocial activities.
The reason I call them “sideways”, “upwards” and “downwards” souls is that I imagine the world as a causal network arranged with time going along the x-axis and energy level going along the y-axis. So sideways souls diffuse up and down the energy scale, probably staying roughly constant on average, whereas upwards souls diffuse up the energy scale, from low-energy stuff (inert information stored in e.g. DNA) to high-energy stuff (societal dynamics) and downwards souls diffuse down the energy scale, from high-energy stuff (ecological niches) to low-energy stuff (information stored in e.g. brain synapses).
I’m having trouble following whether this categories the definition/concept of a soul, or the causality and content of this conception of soul. Is “sideways soul” about structure and material implementation, or about weights and connectivity, independent of substrate? WHICH factors are removed from upwards (“genes” and “utility function” are VERY different dimensions, both tiny parts of what I expect create (for genes) or comprise (for utility function) a soul. What about memory? multiple levels of value and preferences (including meta-preferences in how to abstract into “values”)?
Putting “downwards” supernatural ideas into the same framework as more logical/materialist ideas confuses me—I can’t tell if that makes it a more useful model or less.
When you get into the particulars, there are multiple feasible notions of sideways soul, of which material implementation vs weights and connectivity are the main ones. I’m most sympathetic to weights and connectivity.
I have thought less about and seen less discussion about upwards souls. I just mentioned it because I’d seen a brief reference to it once, but I don’t know anything in-depth. I agree that both genes and utility function seem incomplete for humans, though for utility maximizers in general I think there is some merit to the soul == utility function view.
Memory would usually go in sideways soul, I think.
idk
Sideways vs upwards vs downwards is more meant to be a contrast between three qualitatively distinct classes of frameworks than it is meant to be a shared framework.
Excellent! I like the move of calling this “soul” with no reference to metaphysical souls. This is highly relevant to discussions of “free will” if the real topic is self-determination—which it usually is.
“Downwards souls are similar to the supernatural notion of souls” is an explicit reference to metaphysical souls, no?
um, it claims to be :)
I don’t think that’s got much relationship to the common supernatural notion of souls.
But I read it yesterday and forgot that you’d made that reference.
What special characteristics do you associate with the common supernatural notion of souls which differs from what I described?
The word ‘soul’ is so tied in my mind to implausible metaphysical mythologies that I’d parse this better if the word were switched for something like ‘quintessence’ or ‘essential self’ or ‘distinguishing uniqueness’.
What implausible metaphysical mythologies is it tied up with? As mentioned in my comment, downwards souls seem to satisfy multiple characteristics we’d associate with mythological souls, so this and other things makes me wonder if the metaphysical mythologies might actually be more plausible than you realize.
Thesis: one of the biggest alignment obstacles is that we often think of the utility function as being basically-local, e.g. that each region has a goodness score and we’re summing the goodness over all the regions. This basically-guarantees that there is an optimal pattern for a local region, and thus that the global optimum is just a tiling of that local optimal pattern.
Even if one adds a preference for variation, this likely just means that a distribution of patterns is optimal, and the global optimum will be a tiling of samples from said distribution.
The trouble is, to get started it seems like we would need narrow down the class of functions to have some structure that we can use to get going and make sense of these things. But what would be some general yet still nontrivial structure we could want?
Current agent models like argmax entirely lack any notion of “energy”. Not only does this seem kind of silly on its own, I think it also leads to missing important dynamics related to temperature.
I think I’ve got it, the fix to the problem in my corrigibility thing!
So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don’t do this.) That is, if we say that the AI’s utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.
But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don’t get nonnegligible weight in the utility function.
But I think I’ve figured out a solution now, which I’d call conditional+counterfactual corrigibility. As usual let’s use B to denote that the stop button gets pressed and the AI shuts down, V to denote whichever non-corrigible utility function that we want to make corrigible, and Xs/Xf to denote a counterfactual where people do (s) or do not (f) want to press the stop button. However, we will also use S and F to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define Control(C) to mean that humans can control whether the AI stops or runs in condition C:
Control(C)=E[Bs|C]+E[Vf|C]
and then we simply want to define the utility as saying that people can control the AI in both the S and the F condition:
U=Control(S)+Control(F)
Previously, I strongly emphasized the need to keep the AI “under a counterfactual”—that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren’t the only way to keep the appearance of a phenomenon constant—conditionals work too. And conditionals keep you nicely on distribution, so that’s now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.
That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can’t immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself.
But the reason I think the AI *wouldn’t* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there’s a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like “black hole swallows the earth”, but this would rank pretty low in the AI’s utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch.
But this does not seem like sane reasoning on the AI’s side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.
I was surprised to see this on twitter:
I mean, I’m pretty sure I knew what caused it (this thread or this market), and I guess I knew from Zack’s stuff that rationalist cultism had gotten pretty far, but I still hadn’t expected that something this small would lead to being blocked.
FYI: I have a low bar for blocking people who have according-to-me bad, overconfident, takes about probability theory, in particular. For whatever reason, I find people making claims about that topic, in particular, really frustrating. ¯\_(ツ)_/¯
The block isn’t meant as a punishment, just a “I get to curate my online experience however I want.”
I think blocks are pretty irrelevant unless one conditions on the particular details of the situation. In this case I think the messages I were sharing are very important. If you think my messages are instead unimportant or outright wrong, then I understand why you would find the block less interesting, but in that case I don’t think we can meaningfully discuss it without knowing why you disagree with the messages.
I’m not particularly interested in discussing it in depth. I’m more like giving you a data-point in favor of not taking the block personally, or particularly reading into it.
(But yeah, “I think these messages are very important”, is likely to trigger my personal “bad, overconfident takes about proabrbility theory” neurosis.)
This is awkwardly armchair, but… my impression of Eliezer includes him being just so tired, both specifically from having sacrificed his present energy in the past while pushing to rectify the path of AI development (by his own model thereof, of course!) and maybe for broader zeitgeist reasons that are hard for me to describe. As a result, I expect him to have entered into the natural pattern of having a very low threshold for handing out blocks on Twitter, both because he’s beset by a large amount of sneering and crankage in his particular position and because the platform easily becomes a sinkhole in cognitive/experiential ways that are hard for me to describe but are greatly intertwined with the aforementioned zeitgeist tiredness.
Something like: when people run heavily out of certain kinds of slack for dealing with The Other, they reach a kind of contextual-but-bleed-prone scarcity-based closed-mindedness of necessity, something that both looks and can become “cultish” but where reaching for that adjective first is misleading about the structure around it. I haven’t succeeded in extracting a more legible model of this, and I bet my perception is still skew to the reality, but I’m pretty sure it reflects something important that one of the major variables I keep in my head around how to interpret people is “how Twitterized they are”, and Eliezer’s current output there fits the pattern pretty well.
I disagree with the sibling thread about this kind of post being “low cost”, BTW; I think adding salience to “who blocked whom” types of considerations can be subtly very costly. The main reason I’m not redacting my own whole comment on those same grounds is that I’ve wound up branching to something that I guess to be more broadly important: there’s dangerously misaligned social software and patterns of interaction right nearby due to how much of The Discussion winds up being on Twitter, and keeping a set of cognitive shielding for effects emanating from that seems prudent.
I agree publicizing blocks has costs, but so does a strong advocate of something with a pattern of blocking critics. People publicly announcing “Bob blocked me” is often the only way to find out if Bob has such a pattern.
I do think it was ridiculous to call this cultish. Tuning out critics can be evidence of several kinds of problems, but not particularly that one.
I agree that it is ridiculous to call this cultish if this was the only evidence, but we’ve got other lines of evidence pointing towards cultishness, so I’m making a claim of attribution more so than a claim of evidence.
Blocking a lot isn’t necessarily bad or unproductive… but in this case it’s practically certain blocking thousands will eventually lead to blocking someone genuinely more correct/competent/intelligent/experienced/etc… than himself, due to sheer probability. (Since even a ‘sneering’ crank is far from literal random noise.)
Which wouldn’t matter at all for someone just messing around for fun, who can just treat X as a text-heavy entertainment system. But it does matter somewhat for anyone trying to do something meaningful and/or accomplish certain goals.
In short, blocking does have some, variable, credibility cost. Ranging from near zero to quite a lot, depending on who the blockee is.
Eliezer Yudkowsky being tired isn’t an unrelated accident though. Bayesian decision theory in general intrinsically causes fatigue by relying on people to use their own actions to move outcomes instead of getting leverage from destiny/higher powers, which matches what you say about him having sacrificed his present energy for this.
Similarly, “being Twitterized” is just about stewing in garbage and cursed information, such that one is forced to filter extremely aggressively, but blocking high-quality information sources accelerates the Twitterization by changing the ratio of blessed to garbage/cursed information.
On the contrary, I think raising salience of such discussions helps clear up the “informational food chain”, allowing us to map out where there are underused opportunities and toxic accumulation.
It seems likely to me that Eliezer blocked you because he has concluded that you are a low-quality information source, no longer worth the effort of engaging with.
I agree that this is likely Eliezer’s mental state. I think this belief is false, but for someone who thinks it’s true, there’s of course no problem here.
Please say more about this. Where can I get some?
Working on writing stuff but it’s not developed enough yet. To begin with you can read my Linear Diffusion of Sparse Lognormals sequence, but it’s not really oriented towards practical applications.
I will look forward to that. I have read the LDSL posts, but I cannot say that I understand them, or guess what the connection might be with destiny and higher powers.
One of the big open questions that the LDSL sequence hasn’t addressed yet is, what starts all the lognormals and why are they so commensurate with each other. So far, the best answer I’ve been able to come up with is a thermodynamic approach (hence my various recent comments about thermodynamics). The lognormals all originate as emanations from the sun, which is obviously a higher power. They then split up and recombine in various complicated ways.
As for destiny: The sun throws in a lot of free energy, which can be developed in various ways, increasing entropy along the way. But some developments don’t work very well, e.g. self-sabotaging (fire), degenerating (parasitism leading to capabilities becoming vestigial), or otherwise getting “stuck”. But it’s not all developments that get stuck, some developments lead to continuous progress (sunlight → cells → eukaryotes → animals → mammals → humans → society → capitalism → ?).
This continuous progress is not just accidental, but rather an intrinsic part of the possibility landscape. For instance, eyes have evolved in parallel to very similar structures, and even modern cameras have a lot in common with eyes. There’s basically some developments that intrinsically unblock lots of derived developments while preferentially unblocking developments that defend themselves over developments that sabotage themselves. Thus as entropy increases, such developments will intrinsically be favored by the universe. That’s destiny.
Critically, getting people to change many small behaviors in accordance with long explanations contradicts destiny because it is all about homogenizing things and adding additional constraints whereas destiny is all about differentiating things and releasing constraints.
Meta-point: your communication pattern fits with following pattern:
The reason why smart people find themselves in this pattern is because they expect short inferential distances, i.e., they see their argumentation not like vague esoteric crackpottery, but like a set of very clear statements and fail to put themselves in shoes of people who are going to read this, and they especially fail to account for fact that readers already distrust them because they started conversation with <controversial statement>.
On object level, as stated, you are wrong. Observing heuristic failing should decrease your confidence ih heuristic. You can argue that your update should be small, due to, say, measurement errors or strong priors, but direction of update should be strictly down.
Can you fill in a particular example of me engaging in that pattern so we can address it in the concrete rather than in the abstract?
To be clear, I mean “your communication in this particular thread”.
Pattern:
<controversial statement>
<this statement is false>
<controversial statement>
<this statement is false>
<mix of “this is trivially true because” and “here is my blogpost with esoteric terminology”>
The following responses from EY are more in genre “I ain’t reading this”, because he is more using you as example for other readers than talking directly to you, with following block.
This statement had two parts. Part 1:
And part 2:
Part 2 is what Eliezer said was false, but it’s not really central to my point (hence why I didn’t write much about it in the original thread), and so it is self-sabotaging of Eliezer to zoom into this rather than the actually informative point.
I do think if that thread got you blocked then that’s sad (my guess is I think you were more right than Eliezer, though I haven’t read the full sequence that you linked to).
I do think Twitter blocks don’t mean very much. I think it’s approximately zero evidence of “cultism” or whatever. Most people with many followers on Twitter seem to need to have a hair trigger for blocking, or at least feel like they need to, in order to not constantly have terrible experiences.
This is a very useful point:
I think that this is a point that people not on social media that much don’t get: You need to be very quick to block because otherwise you will not have good experiences on the site otherwise.
I think our instincts may be misleading here, because internet works differently from real life.
In real life, not interacting with someone is the default. Unless you have some kind of relationship with someone, people have no obligation to call you or meet you. And if I call someone on the phone just to say “dude, I disagree with your theory”, I would expect that person to hang up… and maybe say “sorry, I’m busy” before hanging up, if they are extra polite. The interactions are mutually agreed, and you have no right to complain when the other party decides to not give you the time. (And if you keep insisting… that’s what the restraining orders are for.)
On internet, once you sign up to e.g. Twitter, the default is that anyone can talk to you, and if you are not interested in reading the texts they send you, you need to block them. As far as I know, there are no options in the middle between “block” and “don’t block”. (Nothing like “only let them talk to me when it is important” or “only let them talk to me on Tuesdays between 3 PM and 5 PM”.) And if you are a famous person, I guess you need to keep blocking left and right, otherwise you would drown in the text—presumably you don’t want to spend 24 hours a day sifting through Twitter messages, and you want to get the ones you actively want, which requires you to aggressively filter out everything else.
So getting blocked is not an equivalent of getting a restraining order, but more like an equivalent of the other person no longer paying attention to you. Which most people would not interpret as evidence of cultism.
This is the key to understanding why I think it’s more okay to block than a lot of other people think, and the fact that the default is anyone can talk to you means you get way too much crap without blocking lots of people.
I think whether it’s cultism depends on what model one has of how cults work. I don’t know much about it so I might be totally ignorant, but I think a major factor is just engaging in a futile, draining activity powered by popularity, so one needs to carefully preserve resources and maintain appearances.
Huh, I guess you mean cult in a broader “polarization” sense? Like, where are the democratic and republican parties on the cultishness scale in your model?
Idk, my main point of reference is I recently read is Some Desperate Glory, which was about a cult of terrorists. Polarization generally implies a balanced conflict which isn’t really futile.
I don’t know much about how they work internally. Democracy is a weird systen because you’ve got the adversarial thing that would make it less futile but also the popularity contest thing that would make it more narcissistic and thus more cultish.
This explanation sounds like what they’d say. I think the real reason this is common is more a status thing: it’s a pretty standard strategy for people to try to gain status by “dunking” on tweets by more famous people, and blocking them is the standard countermeasure.
The dunking seems like constant terrible experiences.
The more prominent you are, the more people want to talk with you, and the less time you have to talk with them. You have to shut them out the moment the cost is no longer worth paying.
People should feel free to liberally block one another on social media. Being blocked is not enough to warrant an accusation of cultism.
I did not say that simply blocking me warrants an accusation of cultism. I highlighted the fact that I had been blocked and the context in which it occurred, and then brought up other angles which evidenced cultism. If you think my views are pathetic and aren’t the least bit alarmed by them being blocked, then feel free to feel that way, but I suspect there are at least some people here who’d like to keep track of how the rationalist isolation is progressing and who see merit in my positions.
Again, people block one another on social media for any number of reasons. That just doesn’t warrant feeling alarmed or like your views are pathetic.
We know what the root cause is, you don’t have to act like it’s totally mysterious. So the question is, was this root cause (pushback against Eliezer’s Bayesianism):
An important insight that Eliezer was missing (alarming!)
Worthless pedantry that he might as well block (nbd/pathetic)
Antisocial trolling that ought to be gotten rid of (reassuring that he blocked)
… or something else
Regardless of which of these is the true one, it seems informative to highlight for anyone who is keeping track of what is happening around me. And if the first one is the true one, it seems like people who are keeping track of what is happening around Eliezer would also want to know it.
Especially since it only takes a very brief moment to post and link about getting blocked. Low cost action, potentially high reward.
MIRI full-time employed many critics of bayesianism for 5+ years and MIRI researchers themselves argued most of the points you made in these arguments. It is obviously not the case that critiquing bayesianism is the reason why you got blocked.
Idk, maybe you’ve got a point, but Eliezer was very quick to insist what I said was not the mainstream view and disengage. And MIRI was full of internal distrust. I don’t know enough of the situation to know if this explains it, but it seems plausible to me that the way MIRI kept stuff together was by insisting on a Bayesian approach, and that some generators of internal dissent was from people whose intuition aligned more with non-Bayesian approach.
For that matter, an important split in rationalism is MIRI/CFAR vs the Vassarites, and while I wouldn’t really say the Vassarites formed a major inspiration for LDSL, after coming up with LDSL I’ve totally reevaluated my interpretation of that conflict as being about MIRI/CFAR using a Bayesian approach and the Vassarites using an LDSL approach. (Not absolutely of course, everyone has a mixture of both, but in terms of relative differences.)
I’ve been thinking about how the way to talk about how a neural network works (instead of how it could hypothetically come to work by adding new features) would be to project away components of its activations/weights, but I got stuck because of the issue where you can add new components by subtracting off large irrelevant components.
I’ve also been thinking about deception and its relationship to “natural abstractions”, and in that case it seems to me that our primary hope would be that the concepts we care about are represented at a large “magnitude” than the deceptive concepts. This is basically using L2-regularized regression to predict the outcome.
It seems potentially fruitful to use something akin to L2 regularization when projecting away components. The most straightforward translation of the regularization would be to analogize the regression coefficient to (f(x)−f(x−uuTx))uTuTx, in which case the L2 term would be ||(f(x)−f(x−uuTx))uT||uTx||||2, which reduces to ||f(x)−f(x−uuTx)||2||uTx||2.
If f(w)=Pw(o|i) is the probability[1] that a neural network with weights w gives to an output o given a prompt i, then when you’ve actually explained o, it seems like you’d basically have f(w)−f(w−uuTw)≈f(w) or in other words Pw−uuTw(o|i)≈0. Therefore I’d want to keep the regularization coefficient weak enough that I’m in that regime.
In that case, the L2 term would then basically reduce to minimizing 1||uTw||2, or in other words maximizing ||uTw||2,. Realistically, both this and Pw−uuTw(o|i)≈0 are probably achieved when u=w|w|, which on the one hand is sensible (“the reason for the network’s output is because of its weights”) but on the other hand is too trivial to be interesting.
In regression, eigendecomposition gives us more gears, because L2 regularized regression is basically changing the regression coefficients for the principal components by λλ+α, where λ is the variance of the principal component and α is the regularization coefficient. So one can consider all the principal components ranked by βλλ+α to get a feel for the gears driving the regression. When α is small, as it is in our regime, this ranking is of course the same order as that which you get from βλ, the covariance between the PCs and the dependent variable.
This suggests that if we had a change of basis for w, one could obtain a nice ranking of it. Though this is complicated by the fact that f is not a linear function and therefore we have no equivalent of β. To me, this makes it extremely tempting to use the Hessian eigenvectors V as a basis, as this is the thing that at least makes each of the inputs to f “as independent as possible”. Though rather than ranking by the eigenvalues of Hf(w) (which actually ideally we’d actually prefer to be small rather than large to stay in the ~linear regime), it seems more sensible to rank by the components of the projection of w onto V (which represent “the extent to which w includes this Hessian component”).
In summary, if HwPw(o|i)=VΛVT, then we can rank the importance of each component Vj by (Pw−VjVTjw(o|i)−Pw(o|i))VTjw.
Maybe I should touch grass and start experimenting with this now, but there’s still two things that I don’t like:
There’s a sense in which I still don’t like using the Hessian because it seems like it would be incentivized to mix nonexistent mechanisms in the neural network together with existent ones. I’ve considered alternatives like collecting gradient vectors along the training of the neural network and doing something with them, but that seems bulky and very restricted in use.
If we’re doing the whole Hessian thing, then we’re modelling f as quadratic, yet f(x+δx)−f(x) seems like an attribution method that’s more appropriate when modelling f as ~linear. I don’t think I can just switch all the way to quadatic models, because realistically f is more gonna be sigmoidal-quadratic and for large steps δx, the changes to a sigmoidal-quadratic function is better modelled by f(x+\delta x) - f(x) than by some quadratic thing. But ideally I’d have something smarter...
Normally one would use log probs, but for reasons I don’t want to go into right now, I’m currently looking at probabilities instead.
Much dumber ideas have turned into excellent papers
True, though I think the Hessian is problematic enough that that I’d either want to wait until I have something better, or want to use a simpler method.
It might be worth going into more detail about that. The Hessian for the probability of a neural network output is mostly determined by the Jacobian of the network. But in some cases the Jacobian gives us exactly the opposite of what we want.
If we consider the toy model of a neural network with no input neurons and only 1 output neuron g(w)=∏iwi (which I imagine to represent a path through the network, i.e. a bunch of weights get multiplied along the layers to the end), then the Jacobian is the gradient (Jg(w))j=(∇g(w))j=∏i≠jwi=∏iwiwj. If we ignore the overall magnitude of this vector and just consider how the contribution that it assigns to each weight varies over the weights, then we get (Jg(w))j∝1wj. Yet for this toy model, “obviously” the contribution of weight j “should” be proportional to wj.
So derivative-based methods seem to give the absolutely worst-possible answer in this case, which makes me pessimistic about their ability to meaningfully separate the actual mechanisms of the network (again they may very well work for other things, such as finding ways of changing the network “on the margin” to be nicer).
One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.
When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do image classication, you use deep neural networks, not principal component analysis.
I feel like it’s an interesting question: where does the nonlinearity come from? Many causal relationships seem essentially linear (especially if you do appropriate changes of variables to help, e.g. taking logarithms; for many purposes, monotonicity can substitute for linearity), and lots of variance in sense-data can be captured through linear means, so it’s not obvious why nonlinearity should be so important.
Here’s some ideas I have so far:
Suppose you have a Gaussian mixture distribution with two Gaussians d1=N(μ1,Σ), d2=N(μ2,Σ) with different means and identical covariances. In this case, the function that separates them optimally is linear. However, if the covariances differed between the Gaussians d1=N(μ1,Σ1), d2=N(μ2,Σ2), then the optimal separating function is nonlinear. So this suggests to me that one reason for nonlinearity is fundamental to perception: nonlinearity is necessary if multiple different processes could be generating the data, and you need to discriminate between the processes themselves. This seems important for something like vision, where you don’t observe the system itself, but instead observe light that bounced off the system.
Consider the notion of the habitable zone of a solar system; it’s the range in which liquid water can exist. Get too close to the star and the water will freeze, get too far and it will boil. Here, it seems like we have two monotonic effects which add up, but because the effects aren’t linear, the result can be nonmonotonic.
Many aspects of the universe are fundamentally nonlinear. But they tend to exist on tiny scales, and those tiny scales tend to mostly get loss to chaotic noise, which tends to turn things linear. However, there are things that don’t get lost to noise, e.g. due to conservation laws; these provide fundamental sources of nonlinearity in the universe.
… and actually, most of the universe is pretty linear? The vast majority of the universe is ~empty space; there isn’t much complex nonlinearity that is happening there, just waves and particles zipping around. If we disregard the empty space, then I believe (might be wrong) that the vast majority is stars. Obviously lots of stuff is going on within stars, but all of the details get lost to the high energies, so it is mostly simple monotonic relations that are left. It seems that perhaps nonlinearity tends to live on tiny boundaries between linear domains. The main reason thing that makes these tiny boundaries so relevant, such that we can’t just forget about them and model everything in piecewise linear/piecewise monotonic ways, is that we live in the boundary.
Another major thing: It’s hard to persist information in linear contexts, because it gets lost to noise. Whereas nonlinear systems can have multiple stable configurations and therefore persist it for longer.
There is of course a lot of nonlinearity in organisms and other optimized systems, but I believe they result from the world containing the various factors listed above? Idk, it’s possible I’ve missed some.
It seems like it would be nice to develop a theory on sources of nonlinearity. This would make it clearer why sometimes selecting features linearly seems to work (e.g. consider IQ tests), and sometimes it doesn’t.
Thesis: money = negative entropy, wealth = heat/bound energy, prices = coldness/inverse temperature, Baumol effect = heat diffusion, arbitrage opportunity = free energy.
Maybe this mainly works because the economy is intelligence-constrained (since intelligence works by pulling off negentropy from free energy), and it will break down shortly after human-level AGI?
Thesis: there’s a condition/trauma that arises from having spent a lot of time in an environment where there’s excess resources for no reasons, which can lead to several outcomes:
Inertial drifting in the direction implied by ones’ prior adaptations,
Conformity/adaptation to social popularity contests based on the urges above,
Getting lost in meta-level preparations,
Acting as a stickler for the authorities,
“Bite the hand that feeds you”,
Tracking the resource/motivation flows present.
By contrast, if resources are contingent on a particular reason, everything takes shape according to said reason, and so one cannot make a general characterization of the outcomes.
It’s not clear to me how this results from “excess resources for no reasons”. I guess the “for no reasons” part is crucial here?
Thesis: the median entity in any large group never matters and therefore the median voter doesn’t matter and therefore the median voter theorem proves that democracies get obsessed about stuff that doesn’t matter.
A lot depends on your definition of “matter”. Interesting and important debates are always on margins of disagreement. The median member likely has a TON of important beliefs and activities that are uncontroversial and ignored for most things. Those things matter, and they matter more than 95% of what gets debated and focused on.
The question isn’t whether the entities matter, but whether the highlighted, debated topics matter.
I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.
But now I’ve been thinking about it further, and I think I’ve realized—don’t we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we’re already being myopic in some ways, e.g. when training prediction models. But I don’t think anybody has formally proven that training prediction models myopically rather than nonmyopically is a good idea for any purpose?
So that seems like a good first step. But that immediately raises the question, good for what purpose? Generally it’s justified with us not wanting the prediction algorithms to manipulate the real-world distribution of the data to make it more predictable. And that’s sometimes true, but I’m pretty sure one could come up with cases where it would be perfectly fine to do so, e.g. I keep some things organized so that they are easier to find.
It seems to me that it’s about modularity. We want to design the prediction algorithm separately from the agent, so we do the predictions myopically because modifying the real world is the agent’s job. So my current best guess for the optimality criterion of myopic optimization of predictions would be something related to supporting a wide variety of agents.
Yeah, I think usually when people are interested in myopia, it’s because they think there’s some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers.
I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true—if there’s a simpler myopic solution that’s bad, then myopia won’t help (so how can we predict if this is true?) and if there’s a simpler non-myopic solution that’s good, myopia may actively hurt (this one seems a little easier to predict though).
Thesis: a general-purpose interpretability method for utility-maximizing adversarial search is a sufficient and feasible solution to the alignment problem. Simple games like chess have sufficient features/complexity to work as a toy model for developing this, as long as you don’t rely overly much on preexisting human interpretations for the game, but instead build the interpretability from the ground-up.
The universe has many conserved and approximately-conserved quantities, yet among them energy feels “special” to me. Some speculations why:
The sun bombards the earth with a steady stream of free energy, which leaves out into the night.
Time-evolution is determined by a 90-degree rotation of energy (Schrodinger equation/Hamiltonian mechanics).
Breaking a system down into smaller components primarily requires energy.
While aspects of thermodynamics could apply to many conserved quantities, we usually apply it to energy only, and it was first discovered in the context of energy.
I guess the standard rationalist-empiricist-reductionist answer would be to say that this is all caused by the second point combined with some sort of space symmetry. I would have agreed until recently, but now it feels circular to me since the reduction into energy relies on our energy-centered way of perceiving the world. So instead I’m wondering if the first point is closer to the core.
Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren’t transferred between objects at the everyday, macro level we humans are used to.
E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn’t very useful in day-to-day life.
E.g. 2: conservation of color charge doesn’t really say anything useful about everyday processes, since it’s only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).
The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And… momentum seems roughly as important as energy?
I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you’re interested, I can answer in a separate comment.
At a human level, the counts for each type of atom is basically always conserved too, so it’s not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc..
I guess for momentum in particular, it seems reasonable why it wouldn’t be useful in a thermodynamics-style model because things would woosh away too much (unless you’re dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?
Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.
In ordinary quantum mechanics, time and space are treated very differently: t is a coordinate whereas x is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how x evolves as a function of t.
But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field ϕ(x,t)) evolves as a function of x and t. Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I’m glossing over here, but I wouldn’t be surprised if they ultimately stem from this one).
All of this is to say: our best theory of how nature works (QFT), is neither formulated as “energy-first” nor as “momentum-first”. Instead, energy and momentum are on fairly equal footing.
I suppose that’s true, but this kind of confirms my intuition that there’s something funky going on here that isn’t accounted for by rationalist-empiricist-reductionism. Like why are time translations so much more important for our general work than space translations? I guess because the sun bombards the earth with a steady stream of free energy, and earth has life which continuously uses this sunlight to stay out of equillbrium. In a lifeless solar system, time-translations just let everything spin, which isn’t that different from space-translations.
Ah, so I think you’re saying “You’ve explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?
This is a great question, and as with any question of the form “why does this property emerge from these basic rules”, there’s unlikely to be a short answer. E.g. if you said “given our understanding of the standard model, explain how a cell works”, I’d have to reply “uhh, get out a pen and paper and get ready to churn through equations for several decades”.
In this case, one might be able to point to a few key points that tell the rough story. You’d want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means “one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric”). I imagine that, generically, these solutions behave differently with respect to the “1″ direction and the “3” directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can’t be more specific!
Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.
The best answer to the question is that it serves as essentially a universal resource that can be used to provide a measuring stick.
It does this by being a resource that is limited, fungible, always is better to have more of than less of, and is additive across decisions:
You have a limited amount of joules of energy/negentropy, but you can spend it on essentially arbitrary goods for your utility, and is essentially a more physical and usable form of money in an economy.
Also, more energy is always a positive thing, so that means you never are worse off by having more energy, and energy is linear in the sense that if I’ve spent 10 joules on computation, and spent another 10 joules on computation 1 minute later, I’ve spent 20 joules in total.
Cf this post on the measuring stick of utility problem:
https://www.lesswrong.com/posts/73pTioGZKNcfQmvGF/the-measuring-stick-of-utility-problem
Agree that free energy in many ways seems like a good resource to use as a measuring stick. But matter is too available and takes too much energy to make, so you can’t spend it on matter in practice. So it’s non-obvious why we wouldn’t have a matter-thermodynamics as well as an energy-thermodynamics. I guess especially with oxygen, since it is so reactive.
I guess one limitation with considering a system where oxygen serves an analogous role to sunlight (beyond such systems being intrinsically rare) is that as the oxygen reacts, it takes up elements, and so you cannot have the “used-up” oxygen leave the system again without diminishing the system. Whereas you can have photons leave again. Maybe this is just the fungibility property again, which to some extent seems like the inverse of the “breaking a system down into smaller components primarily requires energy” property (though your statements of fungibility is more general because it also considers kinetic energy).
Thinking further, a key part of it is that temperature has a tendency to mix stuff together, due to the associated microscopic kinetic energy.
Thesis: the problem with LLM interpretability is that LLMs cannot do very much, so for almost all purposes “prompt X ⇒ outcome Y” is all the interpretation we can get.
Counterthesis: LLMs are fiddly and usually it would be nice to understand what ways one can change prompts to improve their effectiveness.
Synthesis: LLM interpretability needs to start with some application (e.g. customer support chatbot) to extend the external subject matter that actually drives the effectiveness of the LLM into the study.
Problem: this seems difficult to access, and the people who have access to it are busy doing their job.
I’m very confused. Can we not do LLM interpretability to try to figure out whether or where superposition holds? Is it not useful to see how SAEs help us identify and intervene on specific internal representations that LLMs generate for real-world concepts?
As an outsider to interpretability, it has long been my (rough) understanding that most of the useful work in interpretability deals precisely with attempts to figure out what is going on inside the model rather than how it responds to outside prompts. So I don’t know what the thesis statement refers to...
I guess to clarify:
Everything has an insanely large amount of information. To interpret something, we need to be able to see what “energy” (definitely literal energy, but likely also metaphorical energy) that information relates to, as the energy is more bounded and unified than the information.
But that’s (the thesis goes) hard for LLMs.
Not really, because this requires some notion of the same vs distinct features, which is not so interesting when the use of LLMs is so brief.
I don’t think so since you’ve often got more direct ways of intervening (e.g. applying gradient updates).
I’m sorry, but I still don’t really understand what you mean here. The phrase “the use of LLMs is so brief” is ambiguous to me. Do you mean to say:
a new, better LLM will come out soon anyway, making your work on current LLMs obsolete?
LLM context windows are really small, so you “use” them only for a brief time?
the entire LLM paradigm will be replaced by something else soon?
something totally different from all of the above?
But isn’t this rather… prosaic and “mundane”?
I thought the idea behind these methods that I have linked was to serve as the building blocks for future work on ontology identification and ultimately getting a clearer picture of what is going on internally, which is a crucial part of stuff like Wentworth’s “Retarget the Search” and other research directions like it.
So the fact that SAE-based updates of the model do not currently result in more impressive outputs than basic fine-tuning does not matter as much compared to the fact that they work at all, which gives us reason to believe that we might be able to scale them up to useful, strong-interpretability levels. Or at the very least that the insights we get from them could help in future efforts to obtain this.
Kind of like how you can teach a dog to sit pretty well just by basic reinforcement, but if you actually had a gears-level understanding of how its brain worked, down to the minute details, and the ability to directly modify the circuits in its mind that represented the concept of “sitting”, then you would be able to do this much more quickly, efficiently, and robustly.
Am I totally off-base here?
Maybe it helps if I start by giving some different applications one might want to use artificial agency for:
As a map: We might want to use the LLM as a map of the world, for instance by prompting us with data from the world and having it assist us with navigating that data. Now, the purpose of a map is to reflect as little information as possible about the world while still providing the bare minimum backbone needed to navigate the world.
This doesn’t work well with LLMs because they are instead trained to model information, so they will carry as much information as possible, and any map-making they do will be an accident driven by mimicking the information it’s seen of mapmakers, rather than primarily as an attempt to eliminate information about the world.
As a controller: We might want to use the LLM to perform small pushes to a chaotic system at times when the system reaches bifurcations where its state is extremely sensitive, such that the system moves in a desirable direction. But again I think LLMs are so busy copying information around that they don’t notice such sensitivities except by accident.
As a coder: Since LLMs are so busy outputting information instead of manipulating “energy”, maybe we could hope that they could assemble a big pile of information that we could “energize” in a relevant way, e.g. if they could write a large codebase and we could then excute it on a CPU and have a program that does something interesting in the world. But in order for this to work, the program shouldn’t have obstacles that stop the “energy” dead in its tracks (e.g. bugs that cause it to crash). But again the LLM isn’t optimizing for doing that, it’s just trying to copy information around that looks like software, and it only makes space for the energy of the CPU and the program functionality as a side-effect of that. (Or as the old saying goes, it’s maximizing lines of code written, not minimizing lines of code used.)
So, that gives us the thesis: To interpret the LLMs, we’d want to build a map of how they connect to the energy in the world, but they really don’t connect very well, so there’s not much to build a map of. The only thing you could really point out is the (input, output) relationships, but once you’ve characterized concrete (input, output) pairs, there’s not really much more of interest to say.
Perhaps both the first and the second, but especially the second: As described above, we might hope you could use them extensively and recursively to build up a big thing, because then for interpretability you could study how to manipulate the contours of that big thing. But that doesn’t really work. So people only use them briefly, rather than extensively.
Retargeting the search is only interesting if the search is able to do big stuff in the world, which LLMs according to the thesis are not.
Thesis: linear diffusion of sparse lognormals contains the explanation for shard-like phenomena in neural networks. The world itself consists of ~discrete, big phenomena. Gradient descent allows those phenomena to make imprints upon the neural networks, and those imprints are what is meant by “shards”.
… But shard theory is still kind of broken because it lacks consideration of the possibility that the neural network might have an impetus to nudge those shards towards specific outcomes.
Thesis: the openness-conscientiousness axis of personality is about whether you live as a result of intelligence or whether you live through a bias for vitality.
In the big five trait model of personality, those are two different axes. Openness is inventive/curious vs consistent/cautious, and conscientiousness is efficient/organized vs extravagant/careless.
I don’t see your comparison (focus on intelligence vs vitality) as single-axis either—they may be somewhat correlated, but not very closely.
I’m not sure I understand the model well enough to look for evidence for or against. But it doesn’t resonate as true enough to be useful.
Big Five is identified by taking the top 5 principal components among different descriptors of people, and then rotating them to be more aligned with the descriptors. Unless one strongly favors the alignment-with-descriptors as a natural criterion, this means that it is as valid to consider any linear combination of the traits as it is to consider the original traits.
Mostly life needs to be focused on vitality to survive. The ability to focus on intelligence is sort of a weird artifact due to massive scarcity of intelligence, making people throw lots of resources at getting intelligence to their place. This wealth of resources allows intellectuals to sort of just stumble around without being biased towards vitality.
Interesting, thank you for the explanation. I’m not sure I understand (or accept, maybe) the dichotomy between intelligence vs vitality—they seem complimentary to me. But I appreciate the dicussion.
There’s also an openness+conscientiousness axis, which is closely related to concepts like “competence”.
So in the original text, you meant “openness minus conscientiousness”? That was not clear to me at all; a hyphen-minus looks much more like a hyphen in that position. A true minus sign (−) would have been noticeable to me; using the entire word would have been even more obvious.
Fair
Thesis: if being loud and honest about what you think about others would make you get seen as a jerk, that’s a you problem. It means you either haven’t learned to appreciate others or haven’t learned to meet people well.
I think this is more general: if you’re seen as a jerk, you haven’t learned how to interact with people (at least the subset that sees you as a jerk).
Being loud and honest about your opinions (though really, “honest” is often a cover for “cherry-picked highlights that aren’t quite wrong, but are not honest full evaluations) is one way to be a jerk, but by no means the only one.
Basically my model is that being silent and dishonest is a way to cover up one’s lack of appreciation for others. Because being loud and honest isn’t being a jerk if your loud honest opinions are “I love and respect you”.
Thought: couldn’t you make a lossless SAE using something along the lines of:
Represent the parameters of the SAE as simply a set of unit vectors for the feature directions.
To encode a vector using the SAE, iterate: find the most aligned feature vector, dot them to get the coefficient for that feature vector, and subtract off the scaled feature vector to get a residual to encode further
With plenty of diverse vectors, this should presumably guarantee excellent reconstruction, so the main issue is to ensure high sparsity, which could be achieved by some sparsity loss (which could probably be more aggressive than L1, due to the better reconstruction?). Main issue I see is the encoding method being inefficient. Dunno if one can optimize my proposal to be fast enough.
Idea: for a self-attention where you give it two prompts p1 and p2, could you measure the mutual information between the prompts using something vaguely along the lines of V1^T softmax(K1 K2^T/sqrt(dK)) V2?
In the context of natural impact regularization, it would be interesting to try to explore some @TurnTrout-style powerseeking theorems for subagents. (Yes, I know he denounces the powerseeking theorems, but I still like them.)
Specifically, consider this setup: Agent U starts a number of subagents S1, S2, S3, …, with the subagents being picked according to U’s utility function (or decision algorithm or whatever). Now, would S1 seek power? My intuition says, often not! If S1 seeks power in a way that takes away power from S2, that could disadvantage U. So basically S1 would only seek power in cases where it expects to make better use of the power than S2, S3, ….
Obviously this may be kind of hard for us to make use of if we are trying to make an AI and we only know how to make dangerous utility maximizers. But if we’re happy with the kind of maximizers we can make on the first order (as seems to apply to the SOTA, since current methods aren’t really utility maximizers) and mainly worried about the mesaoptimizers they might make, this sort of theorem would suggest that the mesaoptimizers would prefer staying nice and bounded.
Theory for a capabilities advance that is going to occur soon:
OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.
Given a bunch of such triplets (S, U_1, A_1), … (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, …, U_n, A_n), which could essentially “generate/distill prompts from examples”.
This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)
A followup option: they could use something a la Constitutional AI to generate perturbations A’_1, …, A’_n. If they have a previous model like the above, they could then generate a perturbation P(S’|U_1, A’_1, …, U_n, A’_n). I consider this significant because this then gives them the training data to create a model P(S’|S, U_1, A_1, A’_1), which essentially allows them to do “linguistic backchaining”: The user can update an output of the network A_1 → A’_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.
Furthermore I imagine this could get combined together into some sort of “linguistic backpropagation” by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.
Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I’m mainly just playing around with this because I think there’s a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.
Actually I suppose they don’t even need to add perturbations to A directly, they can just add perturbations to S and generate A’s from S’. Or probably even look at user’s histories to find direct perturbations to either S or A.
I recently wrote a post presenting a step towards corrigibility using causality here. I’ve got several ideas in the works for how to improve it, but I’m not sure which one is going to be most interesting to people. Here’s a list.
Develop the stop button solution further, cleaning up errors, better matching the purpose, etc..
e.g.
Better explaining the intuitions behind why counterfactuals (and in particular counterfactuals over human preferences) are important for corrigibility.
e.g.
Using counterfactuals to control a paperclip maximizer to be safe and productive
e.g.
Attempting to formally prove that counterfactuals work and/or are necessary, perhaps with a TurnTrout-style argument
Thesis: The motion of the planets are the strongest governing factor for life on Earth.
Reasoning: Time-series data often shows strong changes with the day and night cycle, and sometimes also with the seasons. The daily cycle and the seasonal cycle are governed by the relationship between the Earth and the sun. The Earth is a planet, and so its movement is part of the motion of the planets.
I don’t think anybody would have a problem with the statement “The motion of the planet is the strongest governing factor for life on Earth”. It’s when you make it explicitly plural that there’s a problem.
To some extent true, but consider the analogy to a thesis like “Quantum chromodynamics is the strongest governing factor for life on Earth.” Is this sentence also problematic because it addresses locations and energy levels that have no relevance for Earth?
If you replace it with “quantum chromodynamics”, then it’s still very problematic but for different reasons.
Firstly, there’s no obvious narrowing to equally causal factors (“motion of the planet” vs “motion of the planets”) as there is in the original statement. In the original statement the use of plural instead of singular covers a much broader swath of hypothesis space, and that you haven’t ruled out enough to limit it to the singular. So you’re communicating that you think there is significant credence that motion of more than one planet has a very strong influence on life on Earth.
Secondly, the QCD statement is overly narrow in the stated consequent instead of overly broad in the antecedent: any significant change in quantum chromodynamics would affect essentially everything in the universe, not just life on Earth. “Motion of the planet … life on Earth” is appropriately scoped in both sides of the relation. In the absence of a context limiting the scope to just life on Earth, yes that would be weird and misleading.
Thirdly, it’s generally wrong. The processes of life (and everything else based on chemistry) in physical models depend very much more strongly on the details of the electromagnetic interaction than any of the details of colour force. If some other model produced nuclei of the same charges and similar masses, life could proceed essentially unchanged.
However, there are some contexts in which it might be less problematic. In the context of evaluating the possibility of anything similar to our familiar life under alternative physical constants, perhaps.
In a space of universes which are described by the same models to our best current ones but with different values of “free” parameters, it seems that some parameters of QCD may be the most sensitive in terms of whether life like ours could arise—mostly by mediating whether stars can form and have sufficient lifetime. So in that context, it may be a reasonable thing to say. But in most contexts, I’d say it was at best misleading.
Are there good versions of DAGs for other things than causality?
I’ve found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It’s a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.
In a way, causality describes the paths through which information can “flow”. But information is not the only thing in the universe that gets transferred from node to node; there’s also things like energy, money, etc., which have somewhat different properties but intuitively seem like they could benefit from graph-based models too.
I’m pretty sure I’ve seen a number of different graph-based models for describing different flows like this, but I don’t know their names, and also the ones I’ve seen seemed highly specialized and I’m not sure they’re the best to use. But I thought, it seems quite probable that someone on LessWrong would know of a recommended system to learn.
Linear diffusion of sparse lognormals
Think about it
I have a concept that I expect to take off in reinforcement learning. I don’t have time to test it right now, though hopefully I’d find time later. Until then, I want to put it out here, either as inspiration for others, or as a “called it”/prediction, or as a way to hear critique/about similar projects others might have made:
Reinforcement learning is currently trying to do stuff like learning to model the sum of their future rewards, e.g. expectations using V, A and Q functions for many algorithm, or the entire probability distribution in algorithms like DreamerV3.
Mechanistically, the reason these methods work is that they stitch together experience from different trajectories. So e.g. if one trajectory goes A → B → C and earns a reward at the end, it learns that states A and B and C are valuable. If another trajectory goes D → A → E → F and gets punished at the end, it learns that E and F are low-value but D and A are high-value because its experience from the first trajectory shows that it could’ve just gone D → A → B → C instead.
But what if it learns of a path E → B? Or a shortcut A → C? Or a path F → G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.
Ok, so that’s the problem, but how could it be fixed? Speculation time:
You want to learn an embedding of the opportunities you have in a given state (or for a given state-action), rather than just its potential rewards. Rewards are too sparse of a signal.
More formally, let’s say instead of the Q function, we consider what I would call the Hope function: which given a state-action pair (s, a), gives you a distribution over states it expects to visit, weighted by the rewards it will get. This can still be phrased using the Bellman equation:
Hope(s, a) = rs’ + f Hope(s’, a’)
Where s’ is the resulting state that experience has shown comes after s when doing a, f is the discounting factor, and a’ is the optimal action in s’.
Because the Hope function is multidimensional, the learning signal is much richer, and one should therefore maybe expects its internal activations to be richer and more flexible in the face of new experience.
Here’s another thing to notice: let’s say for the policy, we use the Hope function as a target to feed into a decision transformer. We now have a natural parameterization for the policy, based on which Hope it pursues.
In particular, we could define another function, maybe called the Result function, which in addition to s and a takes a target distribution w as a parameter, subject to the Bellman equation:
Result(s, a, w) = rs’ + f Result(s’, a’, (w-rs’)/f)
Where a’ is the action recommended by the decision transformer when asked to achieve (w-rs’)/f from state s’.
This Result function ought to be invariant under many changes in policy, which should make it more stable to learn, boosting capabilities. Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.
An obvious challenge with this proposal is that states are really latent variables and also too complex to learn distributions over. While this is true, that seems like an orthogonal problem to solve.
Also this mindset seems to pave way for other approaches, e.g. you could maybe have a Halfway function that factors an ambitious hope into smaller ones or something. Though it’s a bit tricky because one needs to distinguish correlation and causation.
Downvoted because conditional on this being true, it is harmful to publish. Don’t take it personally, but this is content I don’t want to see on LW.
Why harmful
Because it’s capability research. It shortens the TAI timeline with little compensating benefit.
It’s capability research that is coupled to alignment:
Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.
I missed that paragraph on first reading, mea culpa. I think that your story about how it’s a win for interpretability and alignment is very unconvincing, but I don’t feel like hashing it out atm. Revised to weak downvote.
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.
The “successor representation” is somewhat close to this. It encodes the distribution over future states a partcular policy expects to visit from a particular starting state, and can be learned via the Bellman equation / TD learning.
Yes, my instant thought too was “this sounds like a variant on a successor function”.
Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don’t need any environmental transitions to solve the use case presented:
If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states’ value-estimates, and so on.
(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)
With MBRL, don’t you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs “in the model”. This value function still needs to chain the estimates backwards.
It’s the ‘same problem’, maybe, but it’s a lot easier to solve when you have an explicit model! You have something you can plan over, don’t need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.
But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.
I’m not doubting the relevance of MBRL, I expect that to take off too. What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
Those are two different things. The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way… (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too—if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
This is again doing the same thing as ‘the same problem’; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the ‘delusions’ (cf. AlphaStar).
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
But I’m not advocating against MBRL, so this isn’t the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.
It’s possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..
I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don’t have mathematical models for “capabilities researchers will find something that can be Learned”), and I’d expect something else to happen. If I was to run off to implement this now, I’d be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.
The trouble with just saying “let’s use decision transformers” is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that’s not nuanced enough. You could use some system that’s trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.
It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they’re going to need to figure out the learning method too.
Thanks for the link! It does look somewhat relevant.
But I think the weighting by reward (or other significant variables) is pretty important, since it generates a goal to pursue, making it emphasize things that can achieved rather than just things that might randomly happen.
Though this makes me think about whether there are natural variables in the state space that could be weighted by, without using reward per se. E.g. the size of (s’ - s) in some natural embedding, or the variance in s’ over all the possible actions that could be taken. Hmm. 🤔
Thesis: Intellectuals are naturally attracted to poop because it is causally downstream of everything and so ends up having unusually high mutual information with everything, despite not causally influencing much (compared to e.g. food).