Hi, I am a Physicist, an Effective Altruist and AI Safety researcher.
Linda Linsefors
Why do capabilities emerge suddenly after long plateaus?
Is it true in general that capabilities emerge suddenly after long plateaus? My understanding is that grocking only happens if you set up the learning condition just right.
In the modular addition grocking experiment, there where a general solution that the nn could learn, and there is also a memorize-each-fact solution that it could learn. If you give it enough training data it will just learn the general solution directly. if you give it too little training data it will just memorize. But if you give it something in between, you’ll get a learning trajectory where it first memorize and then switch.
In general I’m not convinced there is a barrier. I think typically the network is always going down in the loss landscape, however sometimes it just takes time.
Sorry for not elaborating more.
I think this is an illustration of world model crystallization.
I think world model crystallization is a more central example of crystallization than value/agent crystallization. This perspective follows from and is tied up with a lot of other believes I have, that would be easier to explain in person.
Quote from the SSC post Studies on slack, which seems relevant here
Inferential distance is the number of steps it takes to make someone understand and accept a certain idea. Sometimes inferential distances can be very far apart. Imagine trying to convince a 12th century monk that there was no historical Exodus from Egypt. You’re in the middle of going over archaeological evidence when he objects that the Bible says there was. You respond that the Bible is false and there’s no God. He says that doesn’t make sense, how would life have originated? You say it evolved from single-celled organisms. He asks how evolution, which seems to be a change in animals’ accidents, could ever affect their essences and change them into an entirely new species. You say that the whole scholastic worldview is wrong, there’s no such thing as accidents and essences, it’s just atoms and empty space. He asks how you ground morality if not in a striving to approximate the ideal embodied by your essence, you say…well, it doesn’t matter what you say, because you were trying to convince him that some very specific people didn’t leave Egypt one time, and now you’ve got to ground morality.
[...]
Another way of thinking about this is that there are two self-consistent equilibria. There’s your equilibrium, (no Exodus, atheism, evolution, atomism, moral nonrealism), and the monk’s equilibrium (yes Exodus, theism, creationism, scholasticism, teleology), and before you can make the monk budge on any of those points, you have to convince him of all of them.
[...]
The monk certainly shouldn’t immediately accept your claim, when he has countless pieces of evidence for the existence of God, from the spectacular faith healings he has witnessed (“look, there’s this thing called psychosomatic illness, and it’s really susceptible to this other thing called the placebo effect…”) to Constantine’s victory at the Mulvian Bridge despite being heavily outnumbered (“look, I’m not a classical scholar, but some people are just really good generals and get lucky, and sometimes it happens the day after they have weird dreams, I think there’s enough good evidence the other way that this is not the sort of thing you should center your worldview around”). But if he’s willing to entertain your claim long enough to hear your arguments one by one, eventually he can reach the same self-consistent equilibrium you’re at and judge for himself.
I think you’re making the distinction more confusing than it has to be.
There are things that has motivational pull, and there are things that don’t but I do them anyway because they are a necessary step to get what I actually want.
Say I want to get an apple, and the easiest way to get one is going to the store and by some. Going to the sore in this story is clearly an instrumental goal, and enjoying eating my apple is a terminal goal[1]
Things that are instrumental can acquire the property of being terminal by association in our brain, because of how humans brains work. This is not true of every agent design. E.g. if I start to want to go to the store for no particular reason, then going to the store have gotten some terminal goal quality. But it is possibly to repeatedly do a thing for instrumental reasons (including for humans) without that thing every becoming a terminal goal. E.g. I regularly by food, and I’ve never been tempted to go to the store unless if I’m motivated by some terminal goal of buying some specific thing.Every agent that is more complicated than a thermostat does have temporary instrumental goals all the time.
It often feels like the Bostrom paradigm implicitly divides the future into two phases. There’s the instrumental phase, during which your decisions are dominated by trying to improve your long-term ability to achieve your goals.
This is what happens if you’re a long-term consequentialist (values over outcomes not processes), which non-updating goals, and no value discounting. I agree that this is a special mind shape, and not every agent. But you don’t end up there just by having a distinction between terminal and instrumental values.
- ^
Although it can have instrumental qualities too, i.e. I might want to be less hungry.
- ^
are genuinely being used as something approaching an assassination market.
Can you elaborate on this please?
I’m not 100% sure, and I can’t prove it, but I don’t think there’s any direct innate drive for self-esteem to be high.
Some possible evidence for this claim: I’ve heard that people who are alone for long enough time looses a sense of self. I don’t know what that means exactly, but it might mean not having any self valance (positive or negative) because when you’re alone, that does not matter, becuase it was just a proxy for other peoples opinions.
Pet theory: Mirroring is only “subconscious” because we have internalized that it’s cringe to mirror.
Mirroring is an important part of dancing and also a way of saying “yes” in non-verbal consent. In both of these are easier to do, if you let yourself notice what you’re doing.
Update May 2025: I think there’s a more general rule that, if a person wants to do X, then either X has a past and ongoing history of immediately (within a second or so) preceding primary reward, or the person is doing X as a means-to-an-end of getting to Y, where Y is explicitly and consciously represented in their own mind as they do X (for example, think of someone going upstairs to get a sweater). I say this based on how I think reinforcement learning and credit assignment work in the brain. If you buy it, then it would follow that feeling liked / admired has to lead to immediate primary reward, since people are not explicitly thinking about the long-term benefits of social status.
What counts as “primary reward” here? If it means innate drive typ reward triggered by the brainstem and not defer-to-predictor, then this claim seems wrong.
Oh, I guess I’m less clear that I though. My goal is not to suggest self therapy ideas. I’m naming them as evidence of multi agent of mind, that has to be explained some other way if there are in fact no multi agents.
The post is a good explanation on multi agent model of mind. I’ve used this frame and agree that it’s useful. But that’s not the problem I’m trying to adress. The problem is that this can’t possibly be the ground truth. Neither do I find it likely that this model a correct explanation at some middle level of reduction.
I’ve seen people in shard theory, and most lately this, where someone takes the subagent perspective to be way to fundamental.I claim that you have conflicting goals, and for some people it’s sometimes useful to model yourself as being made up of conflicting sub agents. But you aren’t literally made up of different sub agent. The muli agent model of mind is probably good enough for self therapy, but it’s not good enough for alignment research.
The problem with taking this model too seriously (and not just as an useful tool) is that it has the homunculus problem. You still have to explain how agency works. Turns out minds are complicated! And when you start thinking about how to fit even one functional agent in a brain, you realize that there are not going to be lot’s of them.
My shortform was about how come, even though there is no parliament, this is style of self therapy works anyway.
You probably wrote this comment before I was done writing?
I sometimes model myself as having sub-agent, and find this perspective very helpful for self therapy. On the other hand, I don’t believe that I literally have individual independent sub agents in my brain, because agent architecture is actually complicated.
I like Steve Byrnes mind model, because it’s an actual plausible model for how a real world agentic mind might work. It’s also the only such model I know about.
Steve’s model is not a multi agent model, and I can’t think of a multi agent model that works.
Can I explain my own and others sub-agent experiences within Steve’s model. I think I can. I have a few different explanations for different types of sub-agent experience.
Gendelin’s Focusing:
This practice is at least sub-agent agent. My experience is that one part of me is trying to query another part of me.
My current model for what’s happening:1) There is something peripheral in my mind’s context, that the though assessor reacts to. Typically it’s something I have potentially strong emotions about, except in the moment, the thing is very low salience to me so I end up having low salience emotions. I might notice something feels off, but I can’t tell why, because the though assessor does not explain it’s judgments, that’s not how it works. So to find out what is bothering me, I can examine various hypotheses, by increasing their salience in my mind, and see how my though assessor reacts. The way I notice these reactions is via emotional shifts.
Internal Double Crux:
For this I will start by quoting part of the Valence sequence1.5.1 “Valence” (as I’m using the term) is a property of a thought—not a situation, nor activity, nor course-of-action, etc.
For example, suppose I have “mixed feelings” about going to the gym. What does that mean? In all likelihood, it means:
If I think about going to the gym in a certain way—if I pay attention to certain aspects of it, if I think about it using certain analogies / frames-of-reference, etc.—then that thought is appealing (positive valence);
If I think about going to the gym in a different way, then that thought is unappealing (negative valence).
For example, maybe the first thought is something like “I will go to the gym, thus following through on my New Year’s Resolution” and the second thought is something like “I will go to the loud and cold and smelly gym”.
So the action of “going to the gym” does not have a well-defined valence. But each individual thought does have a well-defined valence.
This is a pretty text-book IDC situation.
What is going on here is that you have two clusters of associations of the gym that are colored with different valence. Because valance itself is relevant for associations in our mind (we associate similar things, and “good thins” are different from “bad things”) if the two perspectives have different valance that can be a barrier for integration, causing two different association clusters which are “pro gym” and “anti gym”. If these perspectives gets developed enough, it can feel like two different persons inside you with different beliefs.
In the practice of IDC we hold these perspective up to each other, and letting the valence flow both way across the association, until it reaches some balance. Now, instead of being in two minds over the gym, you’re just feeling kind of ok with going, or something like that.
Dealing with ego-dystonic emotions:
Let’s say I’m angry at my partner. It would be good to understand why I’m angry and sort it out. But I don’t like being angry at my partner. They are great and I love them, and the though that I’m angry at them is very aversive, and therefor hard to think, which makes it hard to resolve what ever is going on. However the though “there is a part of me that is angry at my partner” is less aversive. I can tolerate that though. And now can actually investigate and adress the underlying problem.
I think this is a very large part of the usefulness of multi agent of mind. It lets me avoid both denying my emotion, and also avoid fully owning up to it. It’s a great trick. And it’s not even wrong. It is true that I am feeling that anger. It’s also true that that anger is not everything I am, or everything I’m feeling.
The bit where I autapomorphies the anger is not loadbearing, but it’s also not harmfull.Self sabotage:
I don’t have much experience with this, and it’s not much in the rationalist discourse. But it’s an application of multi-agent of mind theory that I’ve seen elsewhere. It goes something like this: There is a secret part of you that for some reason (different models have different explanations here) is trying to make you fail. There is an active intelligence, in your mind, optimizing against you.What I think is going on is that the though assessor have (for some reason) ended up labeling ”...and then I’ll succeed” as aversive (Steve calls this negative valence, but I think aversive is a better term). What this means thetically is that your mind will expel any plan your believe will lead to you actually succeeding, so you’ll only think of plans, or feel positively about executing plans, that will not let you succeed.
This is can potentially be very hard to notice.
The book Existential Kink recommends that if you notice that you are self sabotaging, you should just accept that you actually want to fail, and engage with this kink, and enjoy it, until you get it out of your system. Why would this work?
It’s very hard to rid your brain of aversions, since you’re avoiding them, you don’t get training data that your aversions are wrong. You can do exposure therapy for some experiences that are easy to produce, but if you’re avoiding success it might be have to just generate some success to see that it’s not to bad. So instead you flip it around and focus on failure is good, and then you can maybe notice that that’s not true…
This explanation isn’t the best, but I think there is something here about getting around the avoidance, and possibly also about getting around an ego-dystonic barrier.
Control-flavoured leadership and related mismangement[2], not mentorship and helping those under you to flourish style, which results in much less effectiveness especially on hard to measure progress
I read How to Hire a Team, and I think I know what you mean, maybe, but I’m not sure.
Are you talking about the post being very focused on hiring (or not) people that will execute the employers mission? Rather than hiring people that will contribute new ideas?
RMN is probably the strongest example of this. CNC is something that some people genuinely want, but it has extremely sharp edges that I am concerned are not being sufficiently well-respected. In particular, the mix of social in-group safety seeking for people trying to join, desperation to save the world, and this being socially rewarded and kind of a networking opportunity I suspect causes many people who, while consenting, are genuinely harmed by the experience. Someone well-placed to assess estimated that 15-20% of the women there are not actually all that into it, which seems honestly horrifying to me.
I agree that this is horrifying.
Have the organizers of RMN been informed of this? If not, this it is potentially very important to tell them.
I put high probability on [it has not occurred to the organizers that people might treat their event as a professional networking event]. (Too tired today to come up with a number.)
I’m not sure there is any way at all to run a sex/BDSM party safety, if some participants are treating it as a professional networking event. But the organizers at least stand a better chance, if they are considering this possibility.I don’t think we should have a community policy that you can’t run a CNC event, because it may cause some people to lie about their kinks for professional advantage. Partly because that would incentivize the people who really want this events, to hide the problem rather than mitigate it. Partly because how would such a policy even work? As far as I know RMN isn’t officialy Rat/LW affiliated. And partly because it’s important to allow fun things, including fun things that are not for everyone.
The actual solution is probably something like [there should be more other (non-sexual) fun bonding type social events]. Although, because rationalists are a decentralized community, there isn’t anyone who is responsible for making this happen. It’s just up to whoever takes this on.
Also, I’m noticing I’m confused. There are other Rationalist social events in the Bay right? There are (semi regular?) reading meetups, and Laser-quest-sardins, and right? I remember both of these being open events. And there is also LessOnline that is super advertised here on LW every year. It doesn’t seem to me like the community is so closed that anyone needs to attend a sex party because they don’t have other networking opportunities. Are the people ding this following real incentives or imagined incentives? No victim blaming, but the the solution is different in these two cases. If it’s imagine incentives, then this only requires better information about other options, which is much easier than making up whole new events.
The structure Instrumental vs Terminal was pointing to seems better described as Managed vs Unmanaged Goal-Models. A cognitive process will often want to do things which it doesn’t have the affordances to directly execute on given the circuits/parts/mental objects/etc it has available. When this happens, it might spin up another shard of cognition/search process/subagent, but that shard having fully free-ranging agency is generally counterproductive for the parent process.
This doesn’t seem neurologically plausible to me. Happy to argue it out in person.
You are wildly overestimating how much computation in superposition is possible.
The bottleneck for comp-in-sup is not fitting the info into the neurons, but fitting the circuits into the weights.
I think your circuit depth calculations assume that each circuit get’s to use the entire weight matrix at each step. That’s not how this works.
Despite this I over all agree with your picture (but disagree with the implications). I expect may shallow circuits. However you seem to expect that most circuits are mostly end to end, maybe the network picks one or computes all and choses which output to use at the end? I expect the circuits to be much shorter than that, and much more composable. I.e. there are many circuits in cereal, and each output determine which one to trigger next.
I think this structure can be very agentic, when large enough.
I’ve messaged about it in the intercom.
Because I want equations
LW draft editor keeps breaking for me in various ways. I need somewhere else that let’s me collaborate on posts, and preferably also easily let’s me copy over posts to LW.
Claud recommended using HackMD, so I’ll try that. But I’m also interested in human recommendations for what editor to use.
I just read all of post 4
https://www.lesswrong.com/s/hpWHhjvjn67LJ4xXX/p/JRcNNGJQ3xNfsxPj4
There is nothing there about a toy model mapping pairs of integers.
What am I missing?