michael_mjd

Karma: 165

Instrumental Convergence To Offer Hope?

michael_mjd22 Apr 2022 1:56 UTC

12 points

7 comments3 min readLW link

michael_mjd 23 Apr 2022 2:19 UTC
2 points
in reply to: Jeff Rose’s comment on: Instrumental Convergence To Offer Hope?
I like to think of it not like trying to show that agent B is not a threat to C. The way it’s set up we can probably assume B has no chance against C. C also may need to worry about agent D, who is concerned about hypothetical agent E, etc. I think that at some level, the decision an agent X makes is the decision all remaining agents in the hierarchy will make.

That said I sort of agree that’s the real fear about this method. It’s kind of like using super-rationality or something else to solve the prisoner’s dilemma. Are you willing to bet your life the other player would still not choose Defect, despite what the new theory says? That said I feel like there’s something there, whether this would work, and if not, would need some kind of clarification from decision theory.

michael_mjd 23 Apr 2022 2:20 UTC
3 points
in reply to: Daniel Kokotajlo’s comment on: Instrumental Convergence To Offer Hope?
Thanks for pointing to ECL, this looks fascinating!

michael_mjd 27 Apr 2022 0:31 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
War. Poverty. Inequality. Inhumanity. We have been seeing these for millennia caused by nation states or large corporations. But what are these entities, if not greater-than-human-intelligence systems, who happen to be misaligned with human well-being? Now, imagine that kind of optimization, not from a group of humans acting separately, but by an entity with a singular purpose, with an ever diminishing proportion of humans in the loop.

Audience: all, but maybe emphasizing policy makers

michael_mjd 29 Apr 2022 19:54 UTC
5 points
on: How Might an Alignment Attractor Look like?
I posted something I think could be relevant to this: https://www.lesswrong.com/posts/PfbE2nTvRJjtzysLM/instrumental-convergence-to-offer-hope

The takeaway is, for a sufficiently advanced agent, who wants to hedge against the possibility of itself being destroyed by a greater power, may decide the only surviving plan is to allow the lesser life forms some room to optimize their own utility. It’s sort of an asymmetrical infinite game theoretic chain. If every agent kills lower agents, only the maximum survives and no one knows if they are the maximum. If there even is a maximum.

michael_mjd 21 May 2022 7:11 UTC
1 point
on: We have achieved Noob Gains in AI
As a ML engineer, I think it’s plausible. I also think there are some other factors that could act to cushion or mitigate slowdown. First, I think there are more low hanging fruit available. Now that we’ve seen what large transformer models can do on the text domain, and in a text-to-image Dall-E model, I think the obvious next step is to ingest large quantities of video data. We often talk about the sample inefficiency of modern methods as compared with humans, but I think humans are exposed to a TON of sensory data in building their world model. This seems an obvious next step. Though if hardware really stalls, maybe there won’t be enough compute or budget to train a 1T+ parameter multimodal model.
The second mitigating factor I think may be that funding has already been unlocked, to some extent. There is now a lot more money going around for basic research, possibly to the next big thing. The only thing that might stop it is maybe academic momentum into the wrong directions. Though from an x-risk standpoint, maybe that’s not a bad thing, heh.
In my mental model, if the large transformer models are already good enough to do what we’ve shown them to be able to do, it seems possible that the remaining innovations would be more on the side of engineering the right submodules and cost functions. Maybe something along the lines of Yann LeCun’s recent keynotes.

michael_mjd 25 May 2022 2:58 UTC
2 points
on: autonomy: the missing AGI ingredient?
I think this is absolutely correct. GPT-3/PaLM is scary impressive, but ultimately relies on predicting missing words, and its actual memory during inference is just the words in its context! What scares me about this is that I think there are some really simple low hanging fruit to modify something like this to be, at least, slightly more like an agent. Then plugging things like this as components into existing agent frameworks, and finally, having entire research programs think about it and experiment on it. Seems like the problem would crack. You never know, but it doesn’t look like we’re out of ideas any time soon.
This is a question for the community, is there any information hazard in speculating on specific technologies here? It would be totally fun, though seems like it could be dangerous...
My hope was initially that the market wasn’t necessarily focused on this direction. Big tech is generally focused on predicting user behavior, which LLMs look to dominate. But then there’s autonomous cars, and humanoid robots. No idea what will come of those. Thinking the car angle might be slightly safer, because of the need for transparency and explainability, a lot of the logic outside of perception might be hard coded. Humanoid robots… maybe they will take a long time to catch on, since most people are probably skeptical of them. Maybe factory automation...

michael_mjd 26 May 2022 1:28 UTC
1 point
in reply to: Lone Pine’s comment on: autonomy: the missing AGI ingredient?
Yeah, I tend to agree. Just wanted to make sure I’m not violating norms. In that case, my specific thoughts are as follows, with a thought to implementing AI transparency at the end.
There is the observation that the transformer architecture doesn’t have a hidden state like an LSTM. I thought for a while something like this was needed for intelligence, to have a compact representation of the state one is in. (My biased view, that I’ve updated away from, was that the weights represented HOW to think, and less about knowledge.) However, it’s really intractable to back propagate over so many time steps, and transformers have shown us that you really don’t need to. The long term memory is just in the weights.
So, one obvious thing is you could simply make the language model continue to update on its dialogues, including its own response and that of its conversation partner(s). Over time, it could in theory learn from its own conversations and incorporate them into its long term memory, perhaps learning that this type of dialogue involving ‘itself’ is special. (It can even use special tokens or something if need be.)
As for doing variable length computation, the networks themselves do that with the number of layers they have, but that’s really a fixed feature of the architecture. Instead, variable length computation would really be a data issue. Basically, having enough data that “shows your work”. The model can learn to output intermediate results and use attention over those to continue its thought process. There is a small potential for baking transparency into the model this way, where you could force the model, through limited context windows, to carry its thoughts out using human language basically. Then if the model is doing a very careful analysis on whether it’s ready to take over the world, it would need to go through those calculations, and you could read them.
Of course maybe it’ll learn to use code words or something, but it’s a thought. Maybe you would always be able to ask it what was it thinking about there, and maybe the explanations it gives might not always be convincing.
Though, imagine that the machine starts outputting gibberish intermediate steps, and tells the human operators that thinking using human zipfian distributed words is highly inefficient, and the operators shrug their shoulders and say “Oh cool. Makes sense. Carry on.” As I hear said around here, that’s a way to die with less dignity.

michael_mjd 28 May 2022 6:54 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
AI x-risk. It sounds crazy for two reasons. One, because we are used to nothing coming close to human intelligence, and two, because we are used to AI being unintelligent. For the first, the only point of comparison is imagining something that is to us what we are to cats. For the second, though we have not quite succeeded yet, it only takes one. If you have been following the news, we are getting close.

michael_mjd 28 May 2022 6:54 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
It’s easy to imagine that the AI will have an off switch, and that we could keep it locked in a box and ask it questions. But just think about it. If some animals were to put you in a box, do you think you would stay in there forever? Or do you think you’d figure a way out that they hadn’t thought of?

michael_mjd 28 May 2022 6:57 UTC
2 points
on: [$20K in Prizes] AI Safety Arguments Competition
AI existential risk is like climate change. It’s easy to come up with short slogans that make it seem ridiculous. Yet, when you dig deeper into each counterargument, you find none of them are very convincing, and the dangers are quite substantial. There’s quite a lot of historical evidence for the risk, especially in the impact humans have had on the rest of the world. I strongly encourage further, open-minded study.

michael_mjd 28 May 2022 6:57 UTC
1 point
in reply to: michael_mjd’s comment on: [$20K in Prizes] AI Safety Arguments Competition
For ML researchers.

michael_mjd 28 May 2022 6:58 UTC
1 point
in reply to: michael_mjd’s comment on: [$20K in Prizes] AI Safety Arguments Competition
Policy makers

michael_mjd 28 May 2022 6:59 UTC
1 point
in reply to: michael_mjd’s comment on: [$20K in Prizes] AI Safety Arguments Competition
Policy makers.

michael_mjd 1 Jun 2022 4:00 UTC
1 point
on: The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom
I agree, I have also thought I am not completely sure of the dynamics of the intelligence explosion. I would like to have more concrete footing to figure out what takeoff will look like, as neither fast nor slow are proved.
My intuition however is the opposite. I can’t disprove a slow takeoff, but to me it seems intuitive that there are some “easy” modifications that should take us far beyond human level. Those intuitions, though they could be wrong, are thus:
- I feel like human capability is limited in some obvious ways. If I had more time and energy to focus on interesting problems, I could accomplish WAY more. Most likely most of us get bored, lazy, distracted, or obligated by our responsibilities too much to unlock our full potential. Also, sometimes our thinking gets cloudy. Reminds me a bit of the movie Limitless. Imagine just being a human, but where all the parts of your brain were a well-oiled machine.
- A single AI would not need to solve so many coordination problems which bog down humanity as a whole from acting like a superintelligence.
- AI can scale its search abilities in an embarrassingly parallel way. It can also optimize different functions for different things, like imagine a brain built for scientific research.
Perhaps intelligence is hard and won’t scale much farther than this, but I feel like if you have this, you already have supervillain level intelligence. Maybe not “make us look like ants” intelligence, but enough for domination.

michael_mjd 1 Jun 2022 5:41 UTC
1 point
on: The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom
One other thing I’m interested in, is there a good mathematical model of ‘search’? There may not be an obvious answer. I just feel like there is some pattern that could be leveraged. I was playing hide and seek with my kids the other day, and noticed that, in a finite space, you expect there to be finite hiding spots. True, but every time you think you’ve found them all, you end up finding one more. I wonder if figuring out optimizations or discoveries follow a similar pattern. There are some easy ones, then progressively harder ones, but there are far more to be found than one would expect… so to model finding these over time, in a very large room...

michael_mjd 3 Jun 2022 1:57 UTC
1 point
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I think the desire works because most honest people know, if they give a good-sounding answer that is ultimately meaningless, no benefits will come of the answers given. They may eventually stop asking questions, knowing the answers are always useless. It’s a matter of estimating future rewards from building relationships.
Now, when a human gives advice to another human, most of the time it is also useless, but not always. Also, it tends to not be straight up lies. Even in the useless case, people still think there is some utility in there, for example, having the person think of something novel, giving them a chance to vent without appearing to talk to a brick wall, etc.
To teach a GPT to do this, maybe there would have to be some reward signal. To do with purely language modeling, not sure. Maybe you could continue to train it with examples of its own responses and the interviewer’s response afterwards with whether its advice was true or not. With enough of these sessions, perhaps you could run the language model and have it try to predict the human response, and see what it thinks of its own answers, haha.

michael_mjd 3 Jun 2022 2:20 UTC
LW: 5 AF: 1
AF
in reply to: johnswentworth’s comment on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation—how much can you get with just world modeling, versus world modeling plus reward/action combined?
If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence—that the structure for predicting sequences by comparing all pairs of elements of a limited sequence—is general.
Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?
I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that’s “showing your work”) would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn’t be surprised if Google took it up at this point.

michael_mjd 3 Jun 2022 5:53 UTC
1 point
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling. On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer… then we have major problems.
I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it’s not really optimizing for anything other than predicting the next token. It’s not even really aware of the passage of time. I can’t imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent.
I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a “least bad” nondegenerate value function, i.e. one that doesn’t kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, “suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?” There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what’s the next step in capabilities, until the final step, and we may not know it’s the final step.
One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn’t want to know how intelligence works?

michael_mjd 7 Jun 2022 6:58 UTC
10 points
−1
on: AGI Safety FAQ / all-dumb-questions-allowed thread
Has there been effort into finding a “least acceptable” value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that’s not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.
What links here?
- niplav's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by Aryeh Englander (7 Jun 2022 9:44 UTC; 7 points)

michael_mjd

In­stru­men­tal Con­ver­gence To Offer Hope?

Instrumental Convergence To Offer Hope?