AI alignment researcher, ML engineer. Masters in Neuroscience.
Maybe this is released as a pre-explanation for why GPT-4 will have to be delayed before there is public access. Something to point to add to why it would be bad to let everyone use it until they figure out better safety measures.
This is a very good point DragonGod. I agree that the necessary point of increasing marginal returns to cognitive reinvestment has not been convincingly (publicly) established. I fear that publishing a sufficiently convincing argument (which would likely need to include empirical evidence from functional systems) would be tantamount to handing out the recipe for this RSI AI.
Thinking about (innate drives → valenced world states → associated states → learned drives → increasingly abstract valenced empowerment) brings up for me this question of seeking a very specific world state with high predicted valence & empowerment. And this I feel like is accurately described, but awkward to think about, from the frame of Jacob’s W/P/U/V/A distinction. Like how it’s accurate but difficult to think about biology from the frame of movements of protons and electrons. I think if we zoom in on the W/P plan making portion and adopt a different frame, we see a consequentialist plan generator that does directed search through projected futures based on W (world model). And this then is rather like Eliezer’s Outcome Pump. If you zoom out, the Outcome Pump is one part of an agent. It’s only in the zoomed in view that you see a non-sentient search process that searches for valenced empowerment over extrapolations made from running simulations of the World Model. I’d argue that something very like this planning process is occuring in AlphaZero and DeepNash (stratego). But those have narrow world models, and search systems designed to work over narrow world models.
Quote from the Outcome Pump:
Consider again the Tragedy of Group Selectionism: Some early biologists asserted that group selection for low subpopulation sizes would produce individual restraint in breeding; and yet actually enforcing group selection in the laboratory produced cannibalism, especially of immature females. It’s obvious in hindsight that, given strong selection for small subpopulation sizes, cannibals will outreproduce individuals who voluntarily forego reproductive opportunities. But eating little girls is such an un-aesthetic solution that Wynne-Edwards, Allee, Brereton, and the other group-selectionists simply didn’t think of it. They only saw the solutions they would have used themselves.
So, notice this idea that humans are doing an aesthetically guided search. Seems to me this is an accurate description of human thought / planning. I think this has a lot of overlap with the aesthetic imagining of a nice picture being done by Stable Diffusion or other image models. And overlap with using Energy Models to imagine plans out of noise.
Despite feeling that there are some really key points in Jacob’s ‘it all boils down to empowerment’ point of view (which is supported by the paper I linked in my other comment), I still find myself more in agreement with Steven’s points about innate drives.
(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)
(A) Most humans do X because they have an innate drive to do X; (e.g. having sex, breathing)
(B) Most humans do X because they have done X in the past and have learned from experience that doing X will eventually lead to good things (e.g. checking the weather forecast before going out)
(C) Most humans do X because they have indirectly figured out that doing X will eventually lead to good things—via either social / cultural learning, or via explicit means-end reasoning (e.g. avoiding prison, for people who have never been in prison)
So, I think a missing piece here is that ‘empowerment’ is perhaps better described as ‘ability to reach desired states’ where the desire stems from innate drives. This is very different sense of ‘empowerment’ than a more neutral ‘ability to reach any state’ or ‘ability to reach as many states as possible’.
If I had available to me a button which, when I pressed it, would give me 100 unique new ways in which it was possible for me to choose to be tortured and the ability to activate any of those tortures at will… I wouldn’t press that button!
If there was another button that would give me 100 unique new ways to experience pleasure and the ability to activate those pleasures at will, I would be strongly tempted to press it.
Seems like my avoiding the ‘new types of torture’ button is me declining reachability / empowerment / optionality. This illustrates why I don’t think a non-valenced empowerment seeking is an accurate description of human/animal behavior.
Of course, we can learn to associate innate-drive-neutral things, like money, with innate-drive-valenced empowerment. Or even innate-drive-negative things, so long as the benefit sufficiently outweighs the cost.
And once you’ve gotten as far as ‘valenced empowerment with ability to bridge locally negative states’, then you start getting into decision making about various plans over the various conceptual directions in valenced state space (with the valence originating from, but now abstracted away from, innate drives), and this to me is very much what Shard Theory is about.
I see this paper as having some valuable insights for unifying the sort of Multi-Objective variable-strength complex valence/reward system that the neuroscience perspective describes with a need to tie these dynamically weighted objectives together into a cohesive plan of action. https://arxiv.org/abs/2211.10851 (h/t Capybasilisk)
I think if someone negatively reacts to ‘Safety’ thinking you mean ‘try to ban all guns’ instead of ‘teach good firearm safety’, you can rephrase as ‘Control’ in that context. I think Safety is more inclusive of various aspects of the problem than either ‘Control’ or ‘Alignment’, so I like it better as an encompassing term.
I agree about ‘deceive the deployed AGI into thinking it is in a simulation’ being a doomed plan. I do think that a somewhat nearby plan, where the AGI is kept entirely in the simulation from start to finish, is possibly a good idea. In this simbox plan, the AGI is used not to create or enact plans in the real world, but rather as a guinea pig for alignment experiments. Once you’ve learned enough from your experiments, build a new AGI from scratch that you think is more align-able, and try to align it in a new simbox. Only once you’ve repeated this enough times that you are confident you have an align-able architecture do you train the model on any real world data and try to use it to help you make a pivotal act. I imagine that early versions of align-able architecture will need to have strong enforceable limitations on the speed and overall intelligence and creativity of the model, as well as be highly interpretable, reasonably corrigible, etc.
This plan has weak points, but I think they are different from the ones mentioned in this post. The weak points I see with the simbox-guinea-pig plan are:
How sure can you be that the results of the experiments you are running are not being ‘thrown’ by a AGI aware that it is in a simbox?
Can your interpretability methods detect whether the current run of the AGI has noticed that it is in a simbox yet?
How can you monitor the experiments without exposing yourself to infohazards?
Can you keep the AGI weak enough to not break out or pose infohazard danger while still being strong enough that the results of the experiments are relevant? (I assume there will be emergent capabilities that need to be accounted for in the realm of powerful-enough-to-be-useful-for-a-pivotal-act).
Can you setup backup systems such that failures will be set-backs but not catastrophes?
So, the simbox-guinea-pig plan seems hard, and with many open unsolved questions, but not necessarily doomed.
Additionally, I think that the problems you aim an AGI at could matter a lot for how dangerous of responses you get back. Some proposed solutions to posed problems will by their nature be much safer and easier to verify. For example, a plan for a new widget that can be built and tested under controlled lab conditions, and explored thoroughly for booby traps or side effects. That’s a lot safer than a plan to, say, make a target individual the next president of the United States.
Edit: forgot to add what I currently think of as the biggest weakness of this plan. The very high alignment tax. How do you convince the inventor of this AI system to not deploy it and instead run lots of expensive testing on it and only deploy a safer version that might not be achieved for years yet? I think this is where demonstrative bad-behavior-prompting safety evaluations could help in showing when a model was dangerous enough to warrant such a high level of caution.
I agree with the point that TekhneMaker makes, we must not count on efficiency improvements of running intelligence to be hard. They might be, but I expect they won’t be. Instead, my hope is that we can find ways of deliberately limiting the capabilities and improvement rate / learning rate of ml models such that we can keep an AGI constrained within a testing environment. I think this is an easier goal, and more under our control than hoping intelligence will remain compute intensive.
The groups evolve every time someone answers a question. For a little while there were five separate groups, then it merged back to two! Fascinating to see. I hope more people join in to ask and answer questions.
I’ve added a few questions, and I’ll be excited to see what others add.
I think there’s actually a pretty good solution to this that I’ve been looking forward to for years, and that I think the technology is now ready to support. Unfortunately, I’m too busy worrying about, I mean, working on AGI alignment to spend time on this currently. The idea is simple: an open source decentralized recommender/search engine that you keep a local copy of and shared your recommendations of search results over a decentralized social network like Mastodon. Like a combination of Google Search, StumbleUpon, and Reddit. Search results get upvoted or downvoted, and ones that were actually helpful to friends of your friends will be higher ranked for you, and even more so for friends of friends who fit a similar pattern of web usage. There’s been a variety of specific implementations for this suggested. Here’s an example: https://www.researchgate.net/publication/260652116_A_Collaborative_Decentralized_Approach_to_Web_Search
This is along the general theme of ‘we ought to take back ownership of our data and our internet rather than leaving it in the hands of corporate giants to mediate for us’.
I think there are some problems we get into thinking about superintelligences. I think probably trying to have humans control as superintelligence is a way harder problem than having humans control a super-human system made of non-superhuman components. We just need a way to bridge ourselves to the long reflection. I think it’s plausible that a large corporation/ think-tank staffed by human genius level AIs running at 1000x human speed could be powerful enough. There’s no need to jump to “an intelligence so far beyond us it is like we are rats trying to communicate with human geniuses”.
Fascinated by the idea that you might get better resistance to mode collapse if you were using an alternative loss function during fine tuning to avoid gradient starvation. https://arxiv.org/abs/2011.09468
I definitely intend to give my future child a lot of independence and outdoor roaming experience. I loved exploring natural spaces as a kid, and wandered far and wide. At four I happily climbed trees three or four stories tall and looked down at all the 1 floor buildings around. I loved volunteering at the historical farm near my home, working with the draft horses, milking the cows, playing on the haystacks in the barn. I have fond memories of my grandpa teaching me to drive his small tractor on his little farm when I was around 9. My younger brother was somewhat less adventurous than me as a child, and has been rather a helicopter parent with his 3 year old so far. Hopefully he’ll lighten up a bit over the next couple years.
There is certainly some trade-off point though. I also had friends who were even more adventurous than me as teenagers, and some of them came to unfortunate ends. I think that being more free to explore when I was younger helped teach me caution and a sense of my own limits that served me well in my teenage years.
I think a big aspect of ingrained behaviors like love is about shaping the behaviors of the evolved agents to ‘higher average utility behavior patterns’. For love of non-kin / non-mates, this is probably something like “this social organism will tend to do better when it chooses to remain in close social proximity to allies.” So it’s not just about gaining the trust and assistance of community members, it’s also about making sacrifices to stay near enough to them that they could help you if you found yourself in need. Community building is about finding ways of credibly signalling contracts of mutual support, and building histories of repeated incidences of mutual support. Adversity can bring a group closer together by giving them the chance to demonstrate supportive behaviors for each other, increasing the expectation that such will happen again in the future.
I mostly agree with you here, so I’m not the best candidate to write a convincing story that routes around these problems. I think there is some narrow glimmer of hope shining through a narrow sliver of space here though. Starting with the point that you have a powerful plan making AGI, and set it to making plans for pivotal acts, or at least plans for maneuvering you closer to making pivotal acts. Insist that the plan be human-readable and intended to be carried out by human actors. Have your powerful AGI, or other narrow AGIs, audit the plan as a whole and in parts. You can add small amounts of noise to the plan (e.g. switching words for synonyms) to avoid steganography. You can test out the auditing process itself with plans not written by the AGI.
You can wipe the memories of the plan from the AGI and modify parts of it with noise to make it a similar (rather than identical) agent. You can fine tune versions of the AGI on different datasets with different emphases, again to make it a slightly different agent. You can test it out with smaller, safer plans. Then test it out with smaller legible steps towards maneuvering into position for a pivotal act (e.g. gaining money, getting a chosen candidate elected).
Hopefully by making a number of human-legible and human-enacted steps towards getting ready to make a pivotal act, you can reduce the complexity of a plan necessary to go the rest of the way. This also lets you use a less-powerful AGI to make the plan. The less super-humanly intelligent the AGI is, the more likely you are to be able to avoid failure of control or running into an undetected deceptive turn.
There are downsides with this approach. One big one is that it would probably take much more time to enact, and possibly have lower odds of success. Another is that there are a lot of potential infohazard risks to tiptoe around when trying to get human-readable plans out of an otherwise-boxed AGI (extracting a plan is an opening in the box, but hopefully could be done in a controlled safe way).
Just a random thought. I was wondering if it was possible to make a better laser keyboard by having an actual physical keyboard consisting of a mat with highly reflective background and embossed letters & key borders. This would give at least some tactile feedback of touching a key. Also it would give the laser and sensors a consistent environment on which to do their detection allowing for more precise engineering. You could use an infrared laser since you wouldn’t need its projection to make the keyboard visible, and you could use multiple emitters and detectors around the sides of the keyboard. The big downside remaining would be that you’d have to get used to hovering your fingers over the keyboard and having even the slightest touch count as a press.
Yeah, good point. Some balance needs to be struck in such a scenario where people are given the power to do some sort of customization, but not so much that they can warp the intended purpose of the model into being directly harmful or helping them be harmful.
An analogy I like for powerful narrow AI is that I’d hope it could be like public transit. A dedicated driver and maintenance crew make sure it operates safely, but many people get to interact with it and benefit from it. That doesn’t mean unauthorized members of the public get to modify or control it.
I think we’re going to need a system of responsibly managed narrow AIs as public goods in the future. This seems like a much easier problem to tackle than AGI alignment, but still an interesting challenge. It seems to me like email (and its spam filter) should be a public good like the post office. I don’t really trust my current state or federal government (USA) to safely and effectively manage an official email account for me though.
I’ve been thinking lately about how a positive vision of the future for me would likely involve giving everyone access to some sort of electronic assistant who its user could trust with information about their needs and get good feedback (general life advice, medical advice, career advice, counseling, finding and bringing to attention useful information). Making a system worthy of that trust seems difficult, but very beneficial if it could be implemented well.
“As an analogy, the genome specifies some details about how the lungs grow—but lung growth depends on environmental regularities such as the existence of oxygen and nitrogen at certain concentrations and pressures in the atmosphere; without those gasses, lungs don’t grow right. Does that mean the lungs ‘learn’ their structure from atmosphere gasses rather than just from the information in the genome? I think that would be a peculiar way to look at it. ”—Geoffrey Miller
I wanted to highlight this quote because I like this analogy and feel like there’s a very valuable insight here. We need to keep in mind as we evaluate pre-trained machine learning models that we are evaluating a thing which is a result of the interaction between a learning algorithm and a dataset. Sometimes the nature of this dataset and how it shapes the result gets overlooked.