“Nono, you have been misled. I *do* have a hero license.”
Emrik
I’ve been exploring evolutionary metaphors to ML, so here’s a toy metaphor for RLHF: recessive persistence. (Still just trying to learn both fields, however.)
“Since loss-of-function mutations tend to be recessive (given that dominant mutations of this type generally prevent the organism from reproducing and thereby passing the gene on to the next generation), the result of any cross between the two populations will be fitter than the parent.” (k)
Related:
Recessive alleles persists due to overdominance letting detrimental alleles hitchhike on fitness-enhancing dominant counterpart. The detrimental effects on fitness only show up when two recessive alleles inhabit the same locus, which can be rare enough that the dominant allele still causes the pair to be selected for in a stable equilibrium.
The metaphor with deception breaks down due to unit of selection. Parts of DNA stuck much closer together than neurons in the brain or parameters in a neural networks. They’re passed down or reinforced in bulk. This is what makes hitchhiking so common in genetic evolution.
(I imagine you can have chunks that are updated together for a while in ML as well, but I expect that to be transient and uncommon. Idk.)
Bonus point: recessive phase shift.
“Allele-frequency change under directional selection favoring (black) a dominant advantageous allele and (red) a recessive advantageous allele.” (source)
In ML:
Generalisable non-memorising patterns start out small/sparse/simple.
Which means that input patterns rarely activate it, because it’s a small target to hit.
But most of the time it is activated, it gets reinforced (at least more reliably than memorised patterns).
So it gradually causes upstream neurons to point to it with greater weight, taking up more of the input range over time. Kinda like a distributed bottleneck.
Some magic exponential thing, and then phase shift!
One way the metaphor partially breaks down because DNA doesn’t have weight decay at all, so it allows for recessive beneficial mutations to very slowly approach fixation.
Eigen’s paradox is one of the most intractable puzzles in the study of the origins of life. It is thought that the error threshold concept described above limits the size of self replicating molecules to perhaps a few hundred digits, yet almost all life on earth requires much longer molecules to encode their genetic information. This problem is handled in living cells by enzymes that repair mutations, allowing the encoding molecules to reach sizes on the order of millions of base pairs. These large molecules must, of course, encode the very enzymes that repair them, and herein lies Eigen’s paradox...
(I’m not making any point, just wanted to point to interesting related thing.)
Seems like Andy Matuschak feels the same way about spaced repetition being a great tool for innovation.
I like the framing. Seems generally usefwl somehow. If you see someone believing something you think is inconsistent, think about how to money-pump them. If you can’t, then are you sure they’re being inconsistent? Of course, there are lots of inconsistent beliefs that you can’t money-pump, but seems usefwl to have a habit of checking. Thanks!
How do you account for the fact that the impact of a particular contribution to object-level alignment research can compound over time?
Let’s say I have a technical alignment idea now that is both hard to learn and very usefwl, such that every recipient of it does alignment research a little more efficiently. But it takes time before that idea disseminates across the community.
At first, only a few people bother to learn it sufficiently to understand that it’s valuable. But every person that does so adds to the total strength of the signal that tells the rest of the community that they should prioritise learning this.
Not sure if this is the right framework, but let’s say that researchers will only bother learning it if the strength of the signal hits their person-specific threshold for prioritising it.
Number of researchers are normally distributed (or something) over threshold height, and the strength of the signal starts out below the peak of the distribution.
Then (under some assumptions about the strength of individual signals and the distribution of threshold height), every learner that adds to the signal will, at first, attract more than one learner that adds to the signal, until the signal passes the peak of the distribution and the idea reaches satiation/fixation in the community.
If something like the above model is correct, then the impact of alignment research plausibly goes down over time.
But the same is true of a lot of time-buying work (like outreach). I don’t know how to balance this, but I am now a little more skeptical of the relative value of buying time.
Importantly, this is not the same as “outreach”. Strong technical alignment ideas are most likely incompatible with almost everyone outside the community, so the idea doesn’t increase the number of people working on alignment.
That’s fair, but sorry[1] I misstated my intended question. I meant that I was under the impression that you didn’t understand the argument, not that you didn’t understand the action they advocated for.
I understand that your post and this post argue for actions that are similar in effect. And your post is definitely relevant to the question I asked in my first comment, so I appreciate you linking it.
- ^
Actually sorry. Asking someone a question that you don’t expect yourself or the person to benefit from is not nice, even if it was just due to careless phrasing. I just wasted your time.
- ^
No, this isn’t the same. If you wish, you could try to restate what I think the main point of this post is, and I could say if I think that’s accurate. At the moment, it seems to me like you’re misunderstanding what this post is saying.
I would not have made this update by reading your post, and I think you are saying very different things. The thing I updated on from this post wasn’t “let’s try to persuade AI people to do safety instead,” it was the following:
If I am capable of doing an average amount of alignment work per unit time, and I have units of time available before the development of transformative AI, I will have contributed work. But if I expect to delay transformative AI by units of time if I focus on it, everyone will have that additional time to do alignment work, which means my impact is , where is the number of people doing work. Naively then, if , I should be focusing on buying time.[1]
- ^
This assumes time-buying and direct alignment-work is independent, whereas I expect doing either will help with the other to some extent.
- ^
A concrete suggestion for a buying-time intervention is to develop plans and coordination mechanisms (e.g. assurance contracts) for major AI actors/labs to agree to pay a fixed percentage alignment tax (in terms of compute) conditional on other actors also paying that percentage. I think it’s highly unlikely that this is new to you, but didn’t want to bystander just in case.
A second point is that there is a limited number of supercomputers that are anywhere close to the capacity of top supercomputers. The #10 most powerfwl is 0.005% as powerfwl as the #1. So it could be worth looking into facilitating coordination between them.
Perhaps one major advantage of focusing on supercomputer coordination is that the people who can make the relevant decisions[1] may not actually have any financial incentives to participate in the race for new AI systems. They have financial incentives to let companies use their hardware to train AIs, naturally, but they could be financially indifferent to how those AIs are trained.
In fact, if they can manage to coordinate it via something like assurance contract, they may have a collective incentive to demand that AIs are trained in safer alignment-tax-paying ways, because then companies have to buy more computing time for the same level of AI performance. That’s too much to hope for. The main point is just that their incentives may not have a race dynamic.
Who knows.
- ^
Maybe the relevant chain of command goes up to high government in some cases, or maybe there are key individuals or small groups who have relevant power to decide.
- ^
(Update: I’m less optimistic about this than I was when I wrote this comment, but I still think it seems promising.)
Multiplier effects: Delaying timelines by 1 year gives the entire alignment community an extra year to solve the problem.
This is the most and fastest I’ve updated on a single sentence as far back as I can remember. I am deeply gratefwl for learning this, and it’s definitely worth Taking Seriously. Hoping to look into it in January unless stuff gets in the way.
Have other people written about this anywhere?
I have one objection to claim 3a, however: Buying-time interventions are plausibly more heavy-tailed than alignment research in some cases because 1) the bottleneck for buying time is social influence and 2) social influence follows a power law due to preferential attachment. Luckily, the traits that make for top alignment researchers have limited (but not insignificant) overlap with the traits that make for top social influencers. So I think top alignment researchers should still not switch in most cases on the margin.
When walls don’t work, can use ofbucsation? I have no clue about this, but wouldn’t it be much easier to use pbqrjbeqf for central wurds necessary for sensicle discussion so that it wouldn’t be sreachalbe, and then have your talkings with people on fb or something?
Would be easily found if written on same devices or accounts used for LW, but that sounds easier to work around than literally only using paper?
Yes! The way I’d like it is if LW had a “research group” feature that anyone could start, and you could post privately to your research group.
Same! LW is an outstanding counterexample to my belief that resurrections are impossible. But I haven’t incorporated it into my gears-level model yet, and I’m unsure how to. What did LW do differently, or which gear in my head caused me to fail to predict this?
Here’s my definitely-wrong-and-overly-precise model of productivity. I’d be happy if someone pointed out where it’s wrong.
It has three central premises: a) I have proximal (basal; hardcoded) and distal (PFC; flexible) rewards. b) Additionally, or perhaps for the same reasons, my brain uses temporal-difference learning, but I’m unclear on the details. c) Hebbian learning: neurons that fire together, wire together.
If I eat blueberry muffins, I feel good. That’s a proximal reward. So every time my brain produces a motivation to eat blueberry muffins, and I take steps that makes me *predict* that I am closer to eating blueberry muffins, the synapses that produced *that particular motivation* gets reinforced and are more likely to fire again next time.
The brain gets trained to produce the motivations that more reliably produce actions that lead to rewards.
If I get out of bed quickly after the alarm sounds, there are no hardcoded rewards for that. But after I get out of bed, I predict that I am better able to achieve my goals, and that prediction itself is the reward that reinforces the behaviour. It’s a distal reward. Every time the brain produces motivations that in fact get me to take actions that I in fact predict will make me more likely to achieve my goals, those motivations get reinforced.
But I have some marginal control over *which motivations I choose to turn into action*, and some marginal control over *which predictions I make* about whether those actions take me closer to my goals. Those are the two levers with which I am able to gradually take control over which motivations my brain produces, as long as I’m strategic about it. I’m a fledgling mesa-optimiser inside my own brain, and I start out with the odds against me.
I can also set myself up for failure. If I commit to, say, study math for 12 hours a day, then… I’m able to at first feel like I’ve committed to that as long as I naively expect, right then and there, that the commitment takes me closer to my goals. But come the next day when I actually try to achieve this, I run out of steam, and it becomes harder and harder to resist the motivations to quit. And when I quit, *the motivations that led me to quit get reinforced because I feel relieved* (proximal reward). Trying-and-failing can build up quitting-muscles.
If you’re a sufficiently clever mesa-optimiser, you *can* make yourself study math for 12 hours a day or whatever, but you have to gradually build up to it. Never make a large ask of yourself before you’ve sufficiently starved the quitting-pathways to extinction. Seek to build up simple well-defined trigger-action rules that you know you can keep to every single time they’re triggered. If more and more of input-space gets gradually siphoned into those rules, you starve alternative pathways out of existence.
Thus, we have one aspect of the maxim: “You never make decisions, you only ever decide between strategies.”
Thank you for this discussion btw, this is helpfwl. I suspect it’s hitting diminishing returns unless we hone in on practical specifics.
I think our levels of faith in the rationality community is a crux. Here’s what I think, although I would stress again that although I tentatively believe what I say, I am not trying to be safe to defer to. Thus, I omit disclaimers and caveats and instead try to provide perspectives for evaluation. I think this warning is especially prudent here.
We have a really strong jargon hit-rate
The “natural incentives around jargon creation” in most communities favour usefwlness much less compared to this community. I can think of some examples of historically bad jargon:
“Politics is the mind killer” (not irredeemably bad, but net bad nonetheless imo)
“Bayesian”
Not confident here, but I think the term expanded too far from its roots, plus overemphasised. This could be prevented either by an increased willingness to use new terms for neighbouring semantic space, or an increased unwillingness to expand the use shibboleths for new things.
“NPC” (non-player character)
Not irredeemable, but questionable net value.
Probably more here but I can’t recall.
I think our hit-rate so far on jargon has been remarkably good. Even under the assumption that increased coinage reduces accuracy (which I weakly disagree with), it seems on the margin plausible that it will take us closer to the pareto frontier.
I am less worried about becoming marginally more insular
Our collective project is exceedingly dangerous. We’re deactivating our memetic immune system and fumbling towards deliberate epistemic practices that we hope can make up for it. I think rationality education must consist of lowering intuitive defenses in tandem with growing epistemological awareness. And in cases where this education is out of sync, it produces victims.
But I’d be wary of updating too much on Leverage as an indictment of rationality culture in general. That kind of defensiveness is the same mechanism by which hospitals get bureaucrified—they’re minimising false-positives at the cost of everything else.
I suspect that our community’s cultural inclination against these failure modes makes it more likely that our epistemic norms weaken with widespread social integration with other cultures.
I also think, more generally, that norms/advice that were necessary early on, could nowadays actively be hampering our progress. “Be less sure of yourself, seek wisdom from outside sources, etc.” is necessary advice for someone just starting out on the path, but at some point your wisdom so far exceeds outside sources that the advice hits diminishing returns—tune yourself to where you sniff out value of information, whether that be insular or not.
Epistemic status: Do not defer to me. I’m here to provide interesting arguments and patterns that may help enlighten our understanding of things. I’m not optimising my conclusions for being safe to defer to. (It’s the difference between minimising false-positives vs minimising false-negatives.)
A high bar for adopting new jargon does not conflict with a low bar for suggesting jargon. In fact, I think if more jargon is suggested, I expect a lower proportion of them to be adopted, and also that the average quality of winning jargon goes up.
There’s a speed/scope tradeoff here similar to the Zollman effect. If what you care about is that your subculture advances in idea space asap, then adopting words faster could be good. If instead it matters that the culture be accessible to a greater number of people, then a strong prior against jargon seems better. I care more about the progress of the subculture than I do about its breadth, at least on the current margin as I see it.
Point 2 above assumes for the sake of argument that you are correct about accessibility dropping the more jargon there is. But I don’t think the case for that is very strong. I think well-placed jargon makes ideas and whole paradigms a lot easier to learn, and therefore more accessible. Furthermore, I think it’s worth making a distinction between community-accessibility and idea-accessibility. A lot of jargon makes it harder to participate in the community without a greater understanding of the ideas, but they also makes it easier to understand those ideas in the first place. The net effect is probably that it creates a clearer separation between people inside and outside the community, with fewer gradients in between.
Re your concern about epistemic closure (thanks for the jargon btw!)[1]: I think if the community had more widespread healthier epistemic norms, we would be more effective at adopting new ideas and reason rationally about outside perspectives, and be more willing to change our ways en masse if that outside perspective is actually better. The rationality community is qualitatively different than other communities because prestige is to a large extent determined by one’s ability to avoid epistemic failure modes, such that the usual “cult warning signs” apply to a lesser degree.
Saying heretical stuff here, I know, but I did disclaim deferral status in the first sentence, so should be safe : )
Final point: bureaucratic acronyms are just way terribler than Internet-slang acronyms, tho! :p
- ^
This is not meant to be a critique, I just found the irony a little funny. I appreciate your comment, and learning about epistemic closure from the link.
Yeah, all these “alarms” are supposed to warn you that the word (or something) might be misleading, and you should pay extra attention (unless it’s already obvious) to avoid being misled. Or, pay extra attention because there is something you can do in response which is profitable.
nou. im busy rn, maybe later.
O.O!
They should bring it back.
Still the only anime with what at least half-passes for a good ending. Food for thought, thanks! 👍