21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
but very critically it’s fine with modest reductions to risk with high probability over lower chances of completely eliminating risk
Where do you split the “risks” vs “probabilities of risks”?
These are the same object, and you are separating them; the lines you draw around “risks” as the primitive you’re trying to get to generalize, are not an actual thingy which will predictably generalize. Which is most of what I think we’re still disagreeing on.
A probability of risk is also a risk, and so is a probability of probabilities of [...] of risk.
I think the core crux here is that you expect whatever algorithm you implement to create a satisficer, while I’m saying you’re gonna get a maximizer in a trenchcoat. I think this is very important, much more so than the rest of my comments.
If you train an optimizer to avoid risk, it will concentrate its optimization pressure on avoiding risk.
Total consequentialist optimization pressure doesn’t change just because you shift the parameterization of (a representation of) the loss function.
=> This thing is still a maximizer.
What I’m hearing from you is “this risk aversion (to within some
If takeoff is fast, then there’s very little time for this to be relevant [...]
Sum-threshold attacks aren’t about being slow, they’re things which aren’t noticed because they route through many independent channels. I gave bioaccumulants as an example, but in practice it would be more like aerosolized PFA analogues messing with vascular epithelium, pandemics we don’t notice because the symptoms are mild but which impair any range of subtle biological functions, sites like Tiktok inexplicably using more powerful attention algorithms, and many other things which individually go unnoticed.
The reason we use money (or it’s superior analogue of a currency, compute later on) is because it’s the only resource that lets the AI spend it on terminal goals, no matter what the goal is.
If you’re building a loss function in the real world, it’s tacked to your ontology, and so whatever way you’re trying to get risk-aversion to generalize will also be engineered from your ontology, whereas the AI sees a very different slice of the world and will therefore generalize unexpectedly. If its values mostly generalize to things distant from humans, that’s possibly ok or at least not predictably-to-me worse than nothing; if it sees closer to you, it eg learns to really not want people thinking it messed up, or interacting with a computer <untranslatable> executively inhibiting <firework stylometry> or whatever.
Also, if the AI cares on time horizons beyond the singularity, it either:
Needs to trust cooperation deep into the lightcone if it wants not-Badness to continue. I think most(?) LWers would cooperate, but am a lot less sure about AI company leadership once they’re acquired a singularity.
Controls the singularity itself; I don’t think I can predict a superintelligence enough to do this sort of trade.
I imagine you addressed these somewhere but if so, I missed that section.
I should’ve been more precise but was a bit occupied when I wrote that comment. Apologies.
Cubefox accurately said what I meant though:
The worry here is that a misaligned risk averse AI might think the existence of humans is an unpredictable risk since they could actively interfere with its long-term goals.
I expect AI to be nationalized before we get mildly superhuman AGI, and that governments are much harder to cooperate with than employees at companies.
The main problem I see with this approach is that risk-averse AIs are just risk-neutral ones who really don’t want something bad to happen, and optimizing for not-badness causes all of the normal misalignment problems anyway. Especially if it cares about not-badness in the rest of the lightcone.
Hm, I think there’s an implicit assumption that the AI will value things that its company can provide. Kinda the whole issue this approach is trying to help with is that we can’t hardcode AI values, which is related to our inability (as these models scale) to tell what they value at all.
I’m not confident we’ll know what 2025 models “value” even with much better empirical tooling, particularly because human ontologies ground out in a mix of sensory, spatial, and temporal primitives, whereas LLM ontologies are...? You can say the word “token” but I don’t think that captures the weirdness.
It’s quite hard to predict in advance that the risk-aversion you’ve trained doesn’t result in a model which really doesn’t like patterns which look like the mitochondrial electron transport chain or something similarly incompatible with technologically limited humans. More likely than physical grounding, I expect the model’s preferences to relate to how/which ideas are transmitted.
Whatever the value may be, I don’t see AI companies having anywhere near the capacity to prevent whatever the AI finds Bad; governments are more likely to be capable of this. Also, I find it unlikely that AI won’t be nationalized before we get mildly superhuman AGI.
[Sum-threshold attacks seem unlikely.] In large part, this is because I tend towards being more skeptical of AI persuasion than most people in the LW community
Heard. I don’t see any easy ways to train a superpersuader, nor would I want to list any in public for hopefully-obvious reasons. But there are non-superpersuasion sum-thresholds, eg engineering an airborne bioaccumulant which messes with brain function (humans have already done the airborne version to themselves in at least 3 ways, and those were accidents).
(2) and (4) are good points.
In practice, things are better than that, since we can drive the probability of human cooperation to multiple nines, or 99.9% as a minimum, because the costs are negligible from our perspective, while the benefits are large
I don’t see this happening with existing geopolitics, even with much more sane governments. I’d be around 60% confident that no human / organization would succeed at a power grab, so max 90% that any cooperation occurs. Also, there are non-catastrophically-risky (to the AI) disempowerment strategies, which I think we should be modeling (eg the AI gradually steers cultural values towards what it wants, and we never notice).
I.e. the AI will be uncertain who will cooperate with it, and will try weirder strategies than nuclear/nanotech such that we’re less likely to notice. Manipulating what humanity cares about via memetics is one example.
Those sections assume that probability of human cooperation is higher than probability of successful takeover, which doesn’t hold for sufficiently powerful AIs.
This might help for AIs barely capable of takeover, but for stronger AIs, the best risk-reduction strategy is to decisively take over to minimize the chance that humans mess with their utility.
Is there a particularly good reason not to hand out thousands of stickers with a compressed thesis like “Palantir paid for those anti-Bores ads”? That seems like it would reach more people...?
But people love a winner, even people who are normally very statistically literate. And people love trying to tell just-so stories and look for reasons why someone lost. I think partly this has to do with the fact that people have such poor gears of politics and are paying so little attention that the only way they can really try to make sense of whom to listen to is by reading the tea leaves of these individual elections’ binary outcomes.
[...]
Super PACs like LTF (the A16Z/Greg Brockman one) and Fairshake (the crypto one) take advantage of this: they know if they can oppose several underdogs who then lose their elections, then electoral folk wisdom will absorb the dogma that to oppose the super PAC is to throw away your race.
(The “Alex Bores” of crypto regulation was Katie Porter: Fairshake spent $10M, she lost, as did a myriad of other underdog candidates they opposed, and now Congress is cartoonishly pro-crypto and has passed very crypto-friendly regulations accordingly.)
Mostly signaling what’s politically acceptable? See also threshold models of social behavior.
Impact duration up to 8ms can’t be important. The brain is sloshing in fluid for this reason. It’s not stopping in anything like 8ms anyway.
The subarachnoid space, where cerebrospinal fluid sloshes, is 3-6mm in adults. Along with dura, that gives ~1cm.
9mph
I think romeo was talking about a felt emotional response to dominance, whereas you’re talking about optimization pressures. Eg “my agency is being trampled <forms desperation pretzel>” vs “hm today I’m going to p-hack”.
Edit: Did I miss the opening italicized summary, or did you add it in response to this? Either way, that’s the overview I wanted.
I added in response.
Same skill, applied on multiple levels. The skill in “becoming rational”/”coordinating groups of humans”/”aligning AI” is all skill in alignment.
I see what these words might say, but don’t follow the link. Like, seems basically true that rationalism → human coordination works, but AI alignment is such a different thing, so alien to whatever concepts help humans self-align and coordinate.
Perhaps I just need more time to work through this concept. Right now I’m more focused on understanding my own mind, to make better decisions, because I’m finding a crapton of low-hanging fruit very quickly.
I’ve rolled this around in my head for a bit, and it seems to me like, for rationality, “control” of lower processes is better done by something like “improved training data” than operant force.
This is an vignette from my life, not a quote from anyone:
I’ve noticed since I was about 16 that I get sad when seeing attractive women. This always sucked! I tried introspecting, but I was looking at the feeling of sadness (and often trying misguidedly to control that feeling).
Two days ago, I started looking at what the other parts of my mind were doing when I see women; what am I pulling towards, where’s the tension? Oh, partnered romance feels unreachable? Where’s that coming from?
It mostly seems to me that this (unreachable romance) is the wrong conceptualization. Like, my non-deliberate processes are using the wrong concepts. My intuitive ontology is factually wrong, in the sense I care about; and this wrongness tied itself into a self-perpetuating loop (what I’m calling a cognitive attractor).
And so most of the art of rationality, in fact possibly exactly all of it, is to intuitively correct these errors:
I can in fact date, I’ve been accidentally choosing not to
It doesn’t actually hurt me to admit I’m wrong, in most cases
Eyedrops aren’t scary :D
because imposing top-down “control” misaligns:
The optimizer which calls itself an Elliot, wants human flourishing, wants to hold someone
vs the constellation of smaller optimizers Elliot is an intelligible supervenience of
Anyway, your post on BCI facilitated AI alignment looks to me like a step in the same direction. A step towards noticing that AI alignment is downstream of human alignment (in this case, because aligned and augmented humans are more competent which is instrumentally useful), and that the solutions which actually work have more competent humans more tightly integrated in the alignment process for longer
Sounds plausible
rather than keeping a stance of “I’m outside the system, aligning THAT THING is what I’m trying to do, dammit”.
I very much have come to think of myself as a system, nothing special other than this weird consciousness phlogiston, though I haven’t stopped using self/other borders.
Does this fit?
Maybe? I find myself confused about your explanation of AI alignment; but after reading Valentine’s post about memeplexes, I’m thinking you’re talking about being in entirely the wrong frame, where “align the AI” straightforwardly might not be a thing. And “stop doom” might also not be a thing.
(Thanks for linking that, by the way. I preliminarily do expect, if BCIs or similar enhancements work, that moderately superintelligent humans will birth more powerful Friendly hypercreatures. If you have a list of older, similarly gearsy posts, I’d definitely like to read them.)
(I am not much in AF discussion, so I may be entirely misunderstanding what you mean by “empirical grounding”)
By “empirical grounding”, are we talking about parameterization? (For example, the space of all polygons can be parameterized, but interesting concepts like “symmetry” require restricting your parameters).
I assume not, since you said (as the ostensibly non-empirical successful case that we’d get):
AF is a well-defined task like “solving computability” which, if ever successfully solved[2], ends up as a self-contained network of concepts and proofs.
which really should include parameterized descriptions. Can you give a bit more detail on what you mean here?
I see (agree), was misreading the decoder architecture. Will amend this post when I get back to my laptop.
The original study had two different architectures; one decoded phonemes and matched to the nearest of 50 words, while the other was not phonemic, matching only ~10 phrases. I completely missed this architectural gap for the first 20 minutes after Ninety-Three’s second response.
Phonemic decoding seems to scale extremely well; 50 words → 125,000 words only doubles error rate. I expect anticipatory / semantic decoding to scale worse, but not extremely poorly.
I agree that 68% and 45% accuracy are terrible, especially on a 50 word vocabulary. The 68% figure was to contextualize the 45% anticipatory accuracy; to show that anticipation doesn’t cause a dramatic accuracy hit, per your original comment.
Then we see that improved methods at 256-electrode resolution (second study) brings accuracy up to ~76% at 125,000 word vocabulary.
So what I’m extrapolating from this is that, given 2023 SOTA, anticipatory accuracy on ~125,000 word decoding should be ~60-75%. I don’t see why having even a mere hundred times the resolution should get less than 95% accuracy on priors?
Also, the study had a small *test* set, but that’s not the same as *training* on only 10 phrases. Very different statements about underlying capacity.
Yes, with similar accuracy (45% vs 68% in this low-res study) to instantaneous phoneme decoding:
Meanwhile, the area 55b arrays, and the dorsal 55b array in particular, appeared to encode the longer units of language, short sentences and sentences (i.e., those with contextual information), much better than phonemes and words, especially during the reading phase (Figure 5B).
The translation accuracy and precision in that study is quite unimpressive; as I mentioned though, resolution makes an enormous difference:
Enabled by these high-resolution recordings, our study participant—who can no longer speak intelligibly owing to amyotrophic lateral sclerosis—achieved a 9.1% word error rate on a 50-word vocabulary (2.7 times fewer errors than the previous state-of-the-art speech BCI2) and a 23.8% word error rate on a 125,000-word vocabulary (the first successful demonstration, to our knowledge, of large-vocabulary decoding). Our participant’s attempted speech was decoded at 62 words per minute, which is 3.4 times as fast as the previous record
This is with 256 total electrode channels. The tech I’m proposing has about a million times this resolution.
On my model, the psychologically accessible predictions you’re talking about are a result of engrams having worn those computational grooves. Like, your brain fired a ton of anticipatory patterns which were too quiet to feel, but they nonetheless formed the habit you’re reactivating when playing the music.
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically.
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
It sounds like maybe you’re on a similar track?
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
My confusion is about how you are engineering around the model’s confusion in a way which predictably generalizes at all.
Like, any task requires you to reason about a chain of instrumental decisions, and you’re engineering risk aversion… into the entire chain?
Every single inference step requires reasoning under uncertainty, and which steps you’re risk-averse about are not going to line up in a neat and actionable way. This holds in cases where the model has a much more similar ontology as well, because of it thinking more complex thoughts than you.
Your math treats risk, and probabilities in general, as something which can be exposed to a single discounting term, but RLAIF-augmented human oversight isn’t enough to overcome this.
To restate myself from earlier, “uncertainty about risk” is mathematically identical to “risk” and also “uncertainty about uncertainty about risk” etc. and your model blows up when presented with this.
(I’m not confidently saying that this shouldn’t be tried, but my median estimate of the difficulty of alignment goes down from “deriving algebraic geometry as a pre-agricultural human” to “doing the Apollo mission without transistors in 1960s America”. And I’m also heuristically worried about risk-aversion causing s-risks, but don’t have a strong argument for why that would occur, nor is that class of heuristics substantially influencing my thoughts on the math not applying here.)