I have signed no contracts or agreements whose existence I cannot mention.
plex
Needs as you’ve framed them have a fuzzy boundary between needs and wants. Do I need respect or just want it in this situation? So it’s easy to wonder if I’m pressuring someone by framing it as a need.
Yeah, the idea is to go back to as basic a pattern that’s preferred as you can. If I was trying to make it super concrete I’d probably try to unpack it to be “thing grounded in basic human universal reinforcement signals” with a bunch of @Steven Byrnes’s neuroscience, esp this stuff.
Nice! Glad you’re getting stuck in, and good to hear you’ve already read a bunch of the background materials.
The idea of bounded non-maximizing agents / multipolar as safer has looked hopeful to many people during the field’s development. It’s a reasonable place to start, but my guess is if you zoom in on the dynamics of those systems they look profoundly unstable. I’d be enthusiastic to have a quick call to explore the parts of that debate interactively. I’d link a source explaining it, but I think the alignment community has overall done a not great job of writing up the response to this so far.[1]
The very quick headline is something like:
Long-range consequentialism is convergent, unless there are strong guarantees of boundedness or non-maximizer nature which apply to all successors of an AI, powerful dynamic systems fall towards being consequentialists
Power-seeking patterns tend to differentially acquire power
As the RSI cycle spins up, the power differential between humans and AI systems gets so large that we can’t meaningfully steer, and become easily manipulable
Even if initially multipolar, the AIs can engage in a value handshake and effectively merge in a way that’s strongly positive sum for them, and humans are not easily able to participate + would not have as much to offer, so likely get shut out
Nearest unblocked strategy means that attempts to shape the AI with rules get routed around at high power levels
I’d be interested to see if we’ve been missing something, but my guess is systems containing many moderately capable agents (~top human capabilities researcher) which are trained away from being consequentialists in a fuzzy way almost inevitably falls into the attractor of very capable systems either directly taking power from humans or puppeteering the human’s agency as the AIs improve.
Quick answer-sketches to your other questions:
We’d definitely want an indirect normativity scheme which captures thin concepts. One thing to watch for here is that the process for capturing and aligning to thin concepts is principled and robust (including e.g. to a world with super-persuasive AI), as minor divergences between conceptions of thick concepts could easily cause the tails to come apart catastrophically at high power levels.
Skimming through d/acc 2035, it looks like they mostly assume that the sharp left turn generating dynamics don’t happen, rather than suggesting things which avoid those dynamics.[2] They do touch on competitive dynamics in the uncertainties and tensions, but it doesn’t feel effectively addressed and doesn’t seem to be modelling the situation as competition between vastly more cognitively powerful agents and humans?
One direction that I could imagine being promising and something your skills might be uniquely suited for would be to, with clarity about what technology at physical limits is capable of, doing a large-scale consultation to collect data about humanity’s ‘north star’. Let people think through where we would actually like to go, so that a system trying to support humanity’s flourishing can better understand our values. I funded a small project to try and map people’s visions of utopia a few years back (e.g.), but the sampling and structure wasn’t really the right shape to do this properly.
- ^
https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-goal-directedness is one of the less bad attempts to cover this, @the gears to ascension might know or be writing up a better source
- ^
(plus lots of applause lights for things which are actually great in most domains, but don’t super work here afaict)
It’s cool to see you involved in this sphere! I’ve been seeing and hearing about your work for a while, and have been impressed by both your mechanism design and ability to bring it into large-scale use.
Reading through this, I get some impression that it’s missing some background related to some of the models of what strong superintelligence looks like. Both the challenges of the kind of alignment that’s needed to make that go well, and just how extreme the transition will by default end up being.
Even without focusing on that, this work is useful in some timelines and seems worthwhile, but my guess is you’d get a fair amount of improvement in your aim by picking up ideas from some of the people with the longest history in the field. Some of my top picks would be (approx in order of recommendation):
Five theses, two lemmas, and a couple of strategic implications—Captures some of the key dynamics in the landscape
The Most Important Century—Big-picture analysis of the strategic situation by the then-CEO of Open Philanthropy, the largest funder in the space
The Main Sources of AI Risk—Short descriptions of many dynamics involved, with links for more detail
AGI Ruin: A List of Lethalities—In depth exploration of lots of reasons to expect it to be hard to get a good future with AI
A central AI alignment problem: capabilities generalization, and the sharp left turn—The thing that seems most likely to directly precede extinction
Generally, Arbital is a great source of key concepts.
Or, if you’d like something book-length, AI Does Not Hate You: Rationality, Superintelligence, and the Race to Save the World is the best until later this month when If Anyone Builds it, Everyone Dies comes out.
Working memory bounds isn’t super related to non-fuzzy-ness, as you can have a window which slides over context and is still rigorous at every step. Absolute local validity due to well-specifiedness of axioms and rules of inference is closer to the core.
(realised you mean that the axioms and rules of inference are in working memory, not the whole tower, retracted)
I’d go with
Don’t make claims that plausibly conflict with their models, except if they can check the claims (you are a valid source for claims about purely your state).
+
Don’t make underspecified requests, and found your requests in general needs with space for those needs to be met other ways.
NVC is a form of variable scoping for human communication. With it, you can write communication code that avoids the most common runtime conflicts.
Human brains are neural networks doing predictive processing. We receive data from the external world, and not all of that data is trusted in the computer security sense. Some of the data would modify parts of your world model which you’d like to be able to do your own thinking with. It’s jarring and unpleasant for someone to send you an informational packet that as you parse it moves around parts of your cognition you were relying on not being moved. For example, think back to dissonances you’ve felt, or seen in others, due to direct forceful claims about their internal state. This makes sense! Suffering, in predictive processing, is unresolved error signals, two conflicting world-models contained in the same system. If the other person’s data packet tried to make claims directly into your world model, rather than via your normal evaluation functions, you’re often going to end up with suffering-flavoured superpositions in your world-model.
NVC is safe-mode restricted subset of communication where you make sure the type signature of your conversational object makes changes to the other person’s fuzzy non-secured predictive processing state only in ways carefully scoped to fairly reliably not collide with their thought-code, while keeping enough flexibility to resolve conflict. You don’t necessarily want to run it all the time, it does limit a bunch of communication which is nice in high trust conversations where global scope lets you move faster, but it’s amazing as a way to get out or avoid of otherwise painful conflict spirals.
So! Safe scopes:
Feelings—these are claims that are only about your own internal state[1], as updating their other person’s model of you is something you have much more vision into and will rarely object to if you’re doing so visibly earnestly
Needs—stating universal / general needs of humans[2], as these are mostly not questionable (just don’t import strategies for meeting those needs that often collide)
Observations—specific verifiable facts about reality, that they can check if they don’t agree with (may generate dissonance but lets them resolve it, and quickly know they have a clear path to resolving it)
Requests—making asks that are well-defined enough that the other can evaluate cleanly
Seconded, I consistently find your comments both much more valuable and ~zero sneer. I would be dismayed by moderation actions towards you, while supporting those against Said. You might not have a sense of how his are different, but you automatically avoid the costly things he brings.
I think there’s a happy medium between these two bad extremes, and the vast majority of LWers sit in it generally.
If you put a bunch of work into a post, knowing that most other people are seeing a low-quality but very forceful/sneer-y criticism which you haven’t replied to is a lot of discouragement.
I feel vaguely good about this decision. I’ve only had one relatively brief round of Said commenting, but it’s not free.
If Said returns, I’d like him to have something like a “you can only post things which Claude with this specific prompt says it expects to not cause <issues>” rule, and maybe a LLM would have the patience needed to show him some of the implications and consequences of how he presents himself.
Nice! Keen to have you on the map if you’d like to write a short description :)
Edit: I’ll use a reasonable default text from the post, feel free to suggest corrections
This is a special case of Control Vs Opening, which I wrote up badly with some Claude help and never got it published, but the doc has been transformative to at least one person. It covers ways out of the dynamic across a few domains. Hint: how would you disentangle two parts which had this kind of dynamic :)
(you’re very welcome to take the idea and run with it/borrow bits of that post, I think you’ve put a lot more skill points into writing than me and I’d love your audience to have these ideas. Multi-agent models of mind was a major part of its inspiration.)
Nice story, thanks for laying it out like this.
My guess is timelines fall off this hopeful story around:
Things moving too quickly once automated research comes in
Labs go straight for RSI once it becomes available rather than slowing enough
Even if they vaguely try to slow down advanced systems converge towards unhobbling themselves and getting this past our safeguards
The control walls fall too fast and sometimes silently
Our interpretability tools don’t hold up through the architectural phase transitions, etc.
Agent Foundations being too tricky to pass to AI research teams to do it properly in time
Labs don’t have enough in-house philosophical competence to know what to ask their systems or check their answers
AF is hard to verify in a way that defies easy feedback loops
Current AF field likely is missing crucial considerations which will kill us if the AI research systems can’t locate areas of research needed that we haven’t noticed
Very few people have deep skills here (maybe a few dozen by my read)
A so-far weaker version of this likely works between LLMs and humans (p(0.97), though much lower that it’s easily detectable already). Humans build predictive models of people they are talking to and copy patterns subliminally too, a bunch of persuasion works this way.
LLM persuasion is going to get wild.
Intergenerational trauma in the making :(
I have some fun semi gears models of what’s probably going on based on some of the Leverage psychology research.[1] If correct, wow the next bit of this ride is going to have some wild turns.
- ^
Read sections 7/8/9 especially. Leverage had bad effects on some people (and good or mixed on others), but this was strongly downstream of them doing a large-scale competent effort to understand minds which had fruits. The things they’re pointing to work via text channels too, only somewhat attenuated, because minds decompress each other’s states.
- ^
Yeah, I think this basically goes through. Though, even if we did have the ability to make rule-following AI, that doesn’t mean we’re now safe to go ahead. There are several other hurdles, like finding rules which make things good when superintelligent optimization is applied, and getting good enough goal-content integrity to not miss a step of self modification, plus the various human shaped challenges.
That seems false to me, conflicts between humans share a similar structure across different environments, and generalization is to be expected so evidence in a mild domain is at least indicative of extreme domains. Also, as it happens, I have in fact had extensive interactions with one of the listed subgroups, and they do respond in ways that the reasonable generalizations would expect.
Reading your posts I form a story that you have a strong need to fight back against careless epistemics, which looks from my vantage point like it comes with some rigidity, unwillingness to incorporate forms of evidence that are not extremely well-founded into your models, and maybe a tinge of something that my system reads as contained hostility and superiority. It’s not that extreme here, and my priors might well be coloured by watching clashes between you and other site regulars, but engaging with it brings up some discomfort and a sense that I might end up using bandwidth unproductively.
I don’t super have an ask here, but I do think there’s something here which might be useful to some of your future engagements. I imagine it’s not super fun getting into lots of fights, and I think you can get the good you’re seeking of challenging sloppy reasoning without the downsides with a few adjustments.
My strongest sources of evidence are first hand, having directly seen it dissolve conflicts many times, and heard similarly from people whose judgement I trust. I’ve not gone looking for more formal assessments, similarly to how when I want to try a self help technique I try it and see if it’s a good fit rather than relying on studies, sometimes with recommendations.
If you’d like good evidence, I suggest trying the same? It’s not super complex to learn and test.
The thing I have in mind as north star looks closest to the GD Challenge in scope, but somewhat closer to the CIP one in implementation? The diff is something like:
Focus on superintelligence, which opens up a large possibility-space while rendering many problems people are usually focused on straightforwardly solved (consult rigorous futurists to get a sense of the options).
Identifying cruxes on how people’s values might end up, and using the kinds of deliberative mechanism design in your post here to help people clarify their thinking and find bridges.
I’m glad you’re seeing the challenges of consequentialism. I think the next crux is something like: My guess that consequentialism is a weed which grows in the cracks of any strong cognitive system, and that without formal guarantees of non-consequentialism, any attempt to build an ecosystem of the kind you describe will end up being eaten by processes which are unboundedly goal-seeking. I don’t know of any write-up that hits exactly the notes you’d want here, but some maybe decent intuition pumps in this direction include: The Parable of Predict-O-Matic, Why Tool AIs Want to Be Agent AIs, Averting the convergent instrumental strategy of self-improvement, Averting instrumental pressures, and other articles under arbital corrigibility.
I’d be open to having an on the record chat, but it’s possible we’d get into areas of my models which seem too exfohazardous for public record.