Co-founder and CEO of quiver.trade. Interested in mechanism design and neuroscience. Hopes to contribute to AI alignment.
Twitter: https://twitter.com/azsantosk
Co-founder and CEO of quiver.trade. Interested in mechanism design and neuroscience. Hopes to contribute to AI alignment.
Twitter: https://twitter.com/azsantosk
I am aware of Reinforcement Learning (I am actually sitting right next to Sutton’s book on the field, which I have fully read), but I think you are right that my point is not very clear.
The way I see it RL goals are really only the goals of the base optimizer. The agents themselves either are not intelligent (follow simple procedural ‘policies’) or are mesa-optimizers that may learn to follow something else entirely (proxies, etc). I updated the text, let me know if it makes more sense now.
You are right; I should have written that the AGI will “correct” its biases rather write than it will “remove” them.
I am still confused about these topics. We know that any behavior can be expressed as a complicated world-history utility function, and that therefore anything at all could be rational according to these. So I sometimes think of rationality as a spectrum, in which the simpler the utility function justifying your actions the more rational you are. According to such a definition rationality may actually be opposed to human values at the highest end, so it makes a lot of sense to focus on intelligence that is not fully rational.
Not really sure what you mean by a “honing epistemics” kind of rationality, but I understand that moral uncertainty in the perspective of the AGI may increase the chance that it keep some small fraction of the universe for us, so that would also be great. Is that what you mean? I don’t think it is going to be easy to have the AGI consider some phenomena as outside its scope (such that it would be irrational to meddle with it). If we want the AGI not to leave us alone, then this should be a value that we need to include in their utility function somehow.
Utility function evolution is something complicated. I worry a lot about that, particularly because this seems one of the ways to achieve corrigibility and we really want that, but it also looks as a violation of goal-integrity on the perspective of the AGI. Maybe it is possible for the AGI to consider this “module” responsible for giving feedback to itself as part of itself, just as we (usually) consider our midbrain and other evolutionary ancient “subcortical” areas as a part of us rather than some “other” system interfering with our higher goals.
I agree. Regarding biases that I would like to throw away one day in the future, being careful enough to protect modules important for self-preservation and self-healing, I’d probably like to excessive energy-preserving modules such as ones responsible for laziness, that are only really useful in ancestral environments where food is scarce.
I like your example of senseless winter bias as well. There are probably many examples like that.
What we think is that we might someday build an AI advanced enough that it can, by itself, predict plans for given goal x, and execute them. Is this that otherworldly? Given current progress, I don’t think so.
I don’t think so either. AGIs will likely be capable of understanding what we mean by X and doing plans for exactly that if they want to help. Problem is the AGIs may have other goals in mind by this time.
As for re-inforcement learning, even it seems now impossible to build AGIs with utility functions on that paradigm, nothing gives us the assurance that that will be the paradigm used to be the first AGI.
Sure, it may be possible that some other paradigm allows us to have more control of the utility functions. User tailcalled mentioned John Wentworth’s research (which I will proceed to study as I haven’t done so in depth yet).
(Unless the first AGI can’t be told to do anything at all, but then we would already have lost the control problem.)
I’m afraid that this may be quite a likely outcome if we don’t make much progress in alignment research.
Regarding what the AGI will want then, I expect it to depend a lot on the training regime and on its internal motivation modules (somewhat analogous to the subcortical areas of the brain). My threat model is quite similar to the one defended by Steven Byrnes in articles such as this one.
In particular I think the AI developers will likely give the AGI “creativity modules” responsible for generating intrinsic reward whenever it finds out interesting patterns or abilities. This will help the AGI remain motivated and learning to solve harder and harder problems when outside reward is sparse, which I predict will be extremely useful to make the AGI more capable. But I expect the internalization of such intrinsic rewards to end up generating utility functions that are nearly unbounded in the value assigned to knowledge and computational power, and quite possibly hostile to us.
I don’t think all is lost though. Our brain provide us an example of a relatively-well aligned intelligence: our own higher reasoning in the telencephalon seems relatively well aligned with the evolutionary ancient primitive subcortical modules (not so much with evolution’s base objective of reproduction, though). Not sure how much work evolution had to align these two modules. I’ve heard at least one person arguing that maybe higher intelligence didn’t evolve before because of the difficulties of aligning it. If so, that would be pretty bad.
Also I’m somewhat more optimistic than others in the prospect of creating myopic AGIs that crave very much for short-term rewards that we do control. I think it might be possible (with a lot of effort) to keep such an AGI controlled in a box even if it is more intelligent than humans in general, and that such an AGI may help us with the overall control problem.
I agree my conception is unusual, I am ready to abandon it in favor of some better definition. At the same time I feel like an utility function having way too many components makes it useless as a concept.
Because here I’m trying to derive the utility from the actions, I feel like we can understand the being better the less information is required to encode its utility function, in a Kolmogorov complexity sense, and that if its too complex then there is no good explanation to the actions and we conclude the agent is acting somewhat randomly.
Maybe trying to derive the utility as a ‘compression’ of the actions is where the problem is, and I should distinguish more what the agent does from what the agent wants. An agent is then going to be irrational only if the wants are inconsistent with each other; if the actions are inconsistent with what it wants then it is merely incompetent, which is something else.
That is, that we shouldn’t worry so much about what to tell the genie in the lamp, because we probably won’t even have a say to begin with.
I think you summarized it quite well, thanks! The idea written like that is more clear than what I wrote, so I’ll probably try to edit the article to include this claim explicitly like that. This really is what motivated me to write this post to begin with.
Personally I (also?) think that the right “values” and the right training is more important.
You can put the also, I agree with you.
At the current state of confusion regarding this matter I think we should focus on how values might be shaped by the architecture and training regimes, and try to make progress on that even if we don’t know exactly what the human values are or what utility functions they represent.
Having read Steven’s post on why humans will not create AGI through a process analogous to evolution, his metaphor of the gene trying to do something felt appropriate to me.
If the “genome = code” analogy is the better one for thinking about the relationship of AGIs and brains, then the fact that the genome can steer the neocortex towards such proxy goals as salt homeostasis is very noteworthy, as a similar mechanism may give us some tools, even if limited, to steer a brain-like AGI toward goals that we would like it to have.
I think Eliezer’s comment is also important in that it explains quite eloquently how complex these goals really are, even though they seem simple to us. In particular the positive motivational valence that such brain-like systems attribute to internal mental states makes them very different from other types of world-optimizing agents that may only care about themselves for instrumental reasons.
Also the fact that we don’t have genetic fitness as a direct goal is evidence not only that evolution-like algorithms don’t do inner alignment well, but also that simple but abstract goals such as inclusive genetic fitness may be hard to install in a brain-like system. This is especially so if you agree, in the case of humans, that having genetic fitness as a direct goal, at least alongside the proxies, would probably help fitness, even in the ancestral environment.
I don’t really know how big of a problem this is. Given that our own goals are very complex and that outer alignment is hard, maybe we shouldn’t be trying to put a simple goal into an AGI to begin with.
Maybe there is a path for using these brain-like mechanisms (including positive motivational valence for imagined states and so on) to create a secure aligned AGI. Getting this answer right seems extremely important to me, and if I understand correctly, this is a key part of Steven’s research.
Of course, it is also possible that this is fundamentally unsafe and we shouldn’t do that, but somehow I think that is unlikely. It should be possible to build such systems in a smaller scale (therefore not superintelligent) so that we can investigate their motivations to see what the internal goals are, and whether the system is treacherous or looking for proxies. If it turns out that such a path is indeed fundamentally unsafe, I would expect this to be related to ontological crises or profound motivational changes that are expected to occur as capability increases.
I think it is an interesting idea, and it may be worthwhile even if Dagon is right and it results in regulatory capture.
The reason is, regulatory capture is likely to benefit a few select companies to promote an oligopoly. That sounds bad, and it usually is, but in this case it also reduces the AI race dynamic. If there are only a few serious competitors for AGI, it is easier for them to coordinate. It is also easier for us to influence them towards best safety practices.
In my model the Oracle would stay securely held in something like a Faraday cage with no internet connection and so on.
So yes, some people might want to steal it, but if we have some security I think they would be unlikely to succeed, unless it is a state-level effort.
I think you are right! Maybe I should have actually written different posts about each of these two plans.
And yes, I agree with you that maybe the most likely way of doing what I propose is getting someone ultra rich to back it. That idea has the advantage that it can be done immediately, without waiting for a Math AI to be available.
To me it still seems important to think of what kind of strategical advantages we can obtain with a Math AI. Maybe it is possible to gain a lot more than money (I gave the example of zero-day exploits, but we can most likely get a lot of other valuable technology as well).
While I am sure that you have the best intentions, I believe the framing of the conversation was very ill-conceived, in a way that makes it harmful, even if one agrees with the arguments contained in the post.
For example, here is the very first negative consequence you mentioned:
(bad external relations) People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.
I think one can argue that, this argument being correct, the post itself will exacerbate the problem by bringing greater awareness to these “intentions” in a very negative light.
The intention keyword pattern-matches with “bad/evil intentions”. Those worried about existential risk are good people, and their intentions (preventing x-risk) are good. So we should refer to ourselves accordingly and talk about misguided plans instead of anything resembling bad intentions.
People discussing pivotal acts, including those arguing that it should not be pursued, are using this expression sparingly. Moreover, they seem to be using this expression on purpose to avoid more forceful terms. Your use of scare quotes and your direct association of this expression with bad/evil actions casts a significant part of the community in a bad light.
It is important for this community to be able to have some difficult discussions without attracting backlash from outsiders, and having specific neutral/untainted terminology serves precisely for that purpose.
As others have mentioned, your preferred ‘Idea A’ has many complications and you have not convincingly addressed them. As a result, good members of our community may well find ‘Idea B’ to be worth exploring despite the problems you mention. Even if you don’t think their efforts are helpful, you should be careful to portrait them in a good light.
Another strong upvote for a great sequence. Social-instinct AGIs seems to me a very promising and very much overlooked approach to AGI safety. There seem to be many “tricks” that are “used by the genome” to build social instincts from ground values, and reverse engineering these tricks seem particularly valuable for us. I am eagerly waiting to read the next posts.
In a previous post I shared a success model that relies on your idea of reverse engineering the steering subsystem to build agents with motivations compatible with a safe Oracle design, including the class of reversely aligned motivations. What is your opinion on them? Do you think the set of “social instincts” we would want to incorporate into an AGI changes much if we are optimizing for reverse vs direct intent alignment?
Hi! I’m Kelvin, 26, and I’ve been following LessWrong since 2018. Came here after reading references to Eliezer’s AI-Box experiments from Nick Bostrom’s book.
During high school I participated in a few science olympiads, including Chemistry, Math, Biology and Informatics. Was the reserve member of the Brazilian team for the 2012 International Chemistry Olympiad.
I studied Medicine and later Molecular Science at the University of São Paulo, and dropped out in 2015 to join a high-frequency trading fund based on Brazil. Had a successful career there, and rose up to become one of the senior partners.
Since 2020 I’m co-founder and CEO of TickSpread, a crypto futures exchange based on batch auctions. We are interested in mechanism design, conditional and combinatorial markets, and futarchy.
I’m also personally very interested in machine learning, neuroscience, and AI safety discussions, and I’ve spent quite some time studying these topics on my own, despite having no professional experience on them.
I very much want to be more active on this community, participating in discussions and meeting other people who are also interested in these topics, but I’m not totally sure where to start. I would love for someone to help me get integrated here, so if you think you can do that please let me know :)