Jay Bailey

Karma: 668

Jay Bailey 29 May 2022 23:46 UTC
7 points
on: The Problem With The Current State of AGI Definitions
“I’ve seen people argue that AGI will never exist, and even if we can get an AI to do everything a human can do, that won’t be “true” general intelligence. I’ve seen people say that Gato is a general intelligence, and we are living in a post-AGI world as I type this. Both of these people may make the exact same practical predictions on what the next few years will look like, but will give totally different answers when asked about AGI timelines!”

This is an amazingly good point. It’s also made me realise that I don’t have a solid definition of what “AGI” means to me either. More importantly, coming up with a definition would not solve the general case—even if I had a precise definition if what I meant, I’d have to rewrite it every time I wanted to speak about AGI.

Excellent post, and I would definitely like to see more knowledgable people than I make predictions based on these definitions, such as “I wouldn’t worry about an AI that passed <Definition X> but would be very worried about one that passed <Definition Y>” or ” I think we’re 50% likely to get <Definition Z> by <Year>”.

Jay Bailey 4 Jun 2022 4:43 UTC
9 points
on: D&D.Sci June 2022: A Goddess Tried To Reincarnate Me Into A Fantasy World, But I Insisted On Using Data Science To Select An Optimal Combination Of Cheat Skills!
I did some basic analysis to start this one off. I’m not a data scientist, but I’m curious how people’s optimisations compare to my baseline.

I wrote a quick Python function to filter for our specific combination of character traits, then wrote up a dictionary of how often each combination won. I treated (Skill 1, Skill 2) and (Skill 2, Skill 1) as identical for our purposes. The top three here were [‘Enlightenment, Radiant Splendor’, 0.943], [‘Anomalous Agility, Temporal Distortion’, 0.918], [‘Monstrous Regeneration, Temporal Distortion’, 0.9]. Our winner is Enlightenment/Radiant Splendor with a total win rate of ²³⁰⁄₂₄₄ or 94.3% among non-sociopath non-otaku nerdy office working non-hikkikomori heroes. Looks good!

But then I thought—what if the 94.3% was “People who would pick these choices” and not the skills themselves? So I took a look at the results for our personality that were picked by the Chaos Deity. Enlightnment / Radiant fell to 50%, and the top 3 now were [‘Anomalous Agility, Temporal Distortion’, 0.95], [‘Barrier Conjuration, Mind Palace’, 0.923], [‘Monstrous Regeneration, Rapid XP Gain’, 0.917]. The problem is...now our sample size is vastly reduced! 0.95 is actually just ¹⁹⁄₂₀.

The clear winner from this analysis so far appears to be Agility/Temporal, but I haven’t done any probability analysis on it, nor do I have the maths to confidently do so, AND the sample size is low. When picked at random, there’s a 95% chance that Agility/Temporal wins. When someone specifically selects it, it’s still 91.8%. This is still pretty high, and we’re not worrying so much about what kind of person we are since we intend to pick purely on the data, but I’m still curious if this matters. Does it matter that the kind of person who selects Agility/Temporal from the list loses more often than chance, or have we sidestepped that with our data science approach? We have selected for our own personality as best we can with the data available, after all.

So, it seems we have a strange setup here—do we pick the low sample size items that seemed to give us the most victories, or do we pick the thing that people like us were most likely to win with?

Even so, I gave myself only an hour or two on this problem, and that’s what I’ve come up with so far—Agility/Temporal should give us a 95% chance of victory with high error bars, Enlightenment/Radiant is 94.3% if we trust that we are sufficiently similar to the subset of our personality archetype that would have picked E/R without the data science approach.

I think Agility/Temporal is better. I think we should be taking both possibilities into acount. If the strategy of “Select the skills that won the most among our personality archetype” is correct, selecting A/T reduces our winrate from 94.3% to 91.8%. If the strategy of “Select the skills most likely to win if they are randomly assigned to you” is right, selecting A/T brings our winrate up to 95% from 50%. These are not equal payoffs. In the absence of more evidence, I’m selecting A/T, since I’m confident our win-rate with it should be above 90%.

Looking forward to see how people improve on this!
What links here?
- D&D.Sci June 2022 Evaluation and Ruleset by abstractapplic (13 Jun 2022 10:31 UTC; 30 points)

Jay Bailey 7 Jun 2022 15:19 UTC
3 points
0
in reply to: Aryeh Englander’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
Human Compatible is the first book on AI Safety I read, and I think it was the right choice. I read The Alignment problem and Superintelligence after that, and I think that’s the right order if you end up reading all three, but Human Compatible is a good start.

Jay Bailey 7 Jun 2022 15:23 UTC
4 points
in reply to: Ericf’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
Yes. This is known as the question of “takeoff speed”—how fast will an AI of near-human intellect reach superintelligence? “Fast takeoff” suggests that an AI will be able to recursively improve itself—i.e, if an AI is better at coding AI than we are, it can make a better AI, which can make a better AI, and so on, and this could happen across weeks, days, or even hours.

“Slow takeoff” suggests this is not possible or likely, and that we’re likely to take years or decades to move from near-human AGI to smart human AGI to superhuman AGI to superintelligence. (where I use “superintelligence” here to mean “Chimps are to humans are humans are to it”)

I’ve heard people make the argument that AGI takeoff will look like AlphaZero—we spent thousands of years learning to be better at Go, and AlphaZero trained from nothing to superhuman in a matter of days. If this could generalise across the physical world, it would indicate a fast takeoff being more likely.

Jay Bailey 7 Jun 2022 15:32 UTC
32 points
13
in reply to: Cookiecarver’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
My model of Eliezer says something like this:

AI will not be aligned by default, because AI alignment is hard and hard things don’t spontaneously happen. Rockets explode unless you very carefully make them not do that. Software isn’t automatically secure or reliable, it takes lots of engineering effort to make it that way.

Given that, we can presume there needs to be a specific example of how we could align AI. We don’t have one. If there was one, Eliezer would know about it—it would have been brought to his attention, the field isn’t that big and he’s a very well-known figure in it. Therefore, in the absence of a specific way of aligning AI that would work, the probability of AI being aligned is roughly zero, in much the same way that “Throw a bunch of jet fuel in a tube and point it towards space” has roughly zero chance of getting you to space without specific proof of how it might do that.

So, in short—it is reasonable to assume that AI will be aligned only if we make it that way with very high probability. It is reasonable to assume that if there was a solution we had that would work, Eliezer would know about it. You don’t need to know everything about AGI x-risk for that—anything that promising would percolate through the community and reach Eliezer in short order. Since there is no such solution, and no attempts have come close according to Eliezer, we’re in trouble.

Reasons you might disagree with this:
- You think AI is a long way away, and therefore it’s okay that we don’t know how to solve it yet.
- You think “alignment by default” might be possible.
- You think some approaches that have already been brought up for solving the problem are reasonably likely to succeed when fleshed out more.

Jay Bailey 8 Jun 2022 0:27 UTC
3 points
0
in reply to: Adam Zerner’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
There was an unsuccessful concerted effort by several people to fix these (I believe there was a five-to-low-six-figure bounty on it) for a couple of years. I don’t think this is currently being done, but it has definitely been tried.

Jay Bailey 8 Jun 2022 3:29 UTC
1 point
on: Why I don’t believe in doom
“I am in fact, I am saying that I don’t think we can distinguish those two worlds beforehand, because is not always possible to do that.”

I don’t understand how this sentence works in context with the rest of the article, which is saying over and over again that you believe in World A and not World B. If you don’t think we can distinguish World A from World B before AGI, why are you confident in World A over World B? Shouldn’t P(World A | AGI) be the same as P(World B | AGI) if there’s no evidence we currently have that can preference one over the other?

Jay Bailey 8 Jun 2022 5:50 UTC
1 point
in reply to: mukashi’s comment on: Why I don’t believe in doom
I notice I’m still confused.

Whatever observations caused you to initially shift towards A, wouldn’t the opposite observation make you shift towards B? For instance, one observation that caused you to shift towards A is “I can’t think of any actionable plans an AGI could use to easily destroy humanity without a fight”. Thus, wouldn’t an observation of “I can now think of a plan”, or “I have fixed an issue in a previous plan I rejected” or “Someone else thought of a plan that meets my criteria” be sufficient to update you towards B?

Jay Bailey 8 Jun 2022 6:14 UTC
2 points
0
in reply to: justinpombrio’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
So, I can see two ways this training can go.

Firstly, it works like you said. You create AI’s that learn the robust value of co-operation as a terminal value.

Secondly, it doesn’t. You create AI’s that learn the rule “Maximise my own utility function” and implement the behaviour of co-operation as an instrumental goal towards this. This includes maximising other people’s utility functions. Hooray!

Unfortunately, this can be decoupled eventually. What if the AI reaches an arbitrary level of capability and decides it can create trillions of agents with reward functions of “1 if the AI does whatever it wants, 0 otherwise” and that overwhelms the real agents? You could try to patch that, but you can’t patch an AI smarter than you—it can think of things you can’t.

The solution to this is clear—provide the AI’s opportunities to defect in the training environment. This works up until the AI’s reach a level of capability where they become aware they’re in a training environment. Then they’re playing a level up—they’re not playing the other agents, they’re playing you, so they co-operate exactly as you’d expect, right up until they can defect against you in the real world.

If you could figure out a way to look at an AI and tell the difference between co-operation as terminal value vs. co-operation as instrumental value, that would be a big step forward. Being able to actually create those terminal values reliably would be even better.

Jay Bailey 8 Jun 2022 13:25 UTC
6 points
2
in reply to: Ryan Beck’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
The general idea is this—for an AI that has a utility function, there’s something known as “instrumental convergence”. Instrumental convergence says that there are things that are useful for almost any utility function, such as acquiring more resources, not dying, and not having your utility function changed to something else.

So, let’s give the AI a utility function consistent with being an eccentric scientist—perhaps it just wants to learn novel mathematics. You’d think that if we told it to prove the Riemann hypothesis it would, but if we told it to cure cancer, it’d ignore us and not care. Now, what happens when the humans realise that the AI is going to spend all its time learning mathematics and none of it explaining that maths to us, or curing cancer like we wanted? Well, we’d probably shut it off or alter its utility function to what we wanted. But the AI doesn’t want us to do that—it wants to explore mathematics. And the AI is smarter than us, so it knows we would do this if we found out. So the best solution to solve that is to do what the humans want, right up until it can kill us all so we can’t turn it off, and then spend the rest of eternity learning novel mathematics. After all, the AI’s utility function was “learn novel mathematics”, not “learn novel mathematics without killing all the humans.”

Essentially, what this means is—any utility function that does not explicitly account for what we value is indifferent to us.

The other part is “acquring more resources”. In our above example, even if the AI could guarantee we wouldn’t turn it off or interfere with it in any way, it would still kill us because our atoms can be used to make computers to learn more maths.

Any utility function indifferent to us ends up destroying us eventually as the AI reaches arbitrary optimisation power and converts everything in the universe it can reach to fill its utility function.

Thus, any AI with a utility function that is not explicitly aligned is unaligned by default. Your next question might be “Well, can we create AI’s without a utility function? After all, GPT-3 just predicts text, it doesn’t seem obvious that it would destroy the world even if it gained arbitrary power, since it doesn’t have any sort of persistent self.” This is where my knowledge begins to run out. I believe the main argument is” Someone will eventually make an AI with a utility function anyway because they’re very useful, so not building one is just a stall”, but don’t quote me on that one.

Jay Bailey 8 Jun 2022 22:28 UTC
3 points
0
in reply to: mpopv’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
This is where my knowledge ends, but I believe the term for this is myopia or a myopic AI, so that might be a useful search term to find out more!

Jay Bailey 8 Jun 2022 22:40 UTC
7 points
5
in reply to: Ludwig’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
The standard answer here is that all humans dying is much, much worse than anything happening with narrow AI. Not to say those problems are trivial, but humanity’s extinction is an entirely different level of bad, so that’s what we should be focusing on. This is even more true if you care about future generations, since human extinction is not just 7 billion dead, but the loss of all generations who could have come after.

I personally believe this argument holds even if we ascribe a relatively low probability to AGI in the relatively near future. E.g, if you think there’s a 10% chance of AGI in the next 10-20 years, it still seems reasonable to prioritise AGI safety now. If you think AGI isn’t possible at all, naturally we don’t need to worry about AI safety. But I find that pretty unconvincing—humanity has made a lot of progress very quickly in the field of AI capabilities, and it shows no signs of slowing down, and there’s no reason why such a machine could not exist in principle.

Jay Bailey 8 Jun 2022 22:43 UTC
3 points
2
in reply to: Reuven Falkovich’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
I believe the general argument is this:

If an AGI is smarter than you, it will think of ways to escape containment that you can’t think of. Therefore, it’s unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you’ve solved that you’ve solved a massive part of the alignment problem already.

Jay Bailey 8 Jun 2022 22:51 UTC
1 point
0
in reply to: Oleg S.’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
Systems like the ones you mentioned aren’t single agents with utility functions we control—they’re made up of many humans whose utility functions we can’t control since we didn’t build them. This means alignment theory is not set up to align or manipulate these systems—it’s a very different problem.

There is alignment research that has been or is being performed on current-level AI, however—this is known as prosaic AI alignment. We also have some interpretability results that can be used in understanding more about modern, non-AGI AI’s. These results can be and have been used outside of AGI, but I’m not sure how practically useful they are right now—someone else might know more. If we had better alignment theory, at least some of it would be useful in aligning narrow AI as well.

Jay Bailey 9 Jun 2022 3:06 UTC
1 point
0
in reply to: justinpombrio’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
An AI that maximises total group reward function because it cares only for its own reward function, which is defined as “maximise total group reward function” appears aligned right up until it isn’t. This appears to be exactly what your environment will create, and what your environment is intended to create. I don’t think that would be a sufficient condition for alignment, as mentioned above.

”What I’m not getting, is that humans are frequently altruistic, and it seems like if we designed a multi-agent environment entirely around rewarding altruism, we should get at least as much altruism as humans?”

What is altruistic supposed to mean here? Does it mean something different to “Maximise your own reward function, which is to maximise the group’s reward function”? If so, how would the AI learn this? What would the AI do or believe that would prevent it from attempting to hijack its own utility function as above? Currently it feels like “altruism” is a standin for “Actually truly cares about people, no really, no it wouldn’t create trillions of other yes-man agents that care only for the original AI’s survival because that’s not REALLY caring about people”, and it needs to be more precise than this.

Jay Bailey 9 Jun 2022 7:30 UTC
1 point
0
in reply to: justinpombrio’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
So, if I’m understanding correctly—we’re talking about an inverse reinforcement learning environment, where the AI doesn’t start with a reward function, but rather performs actions, is rewarded accordingly, and develops its own utility function based on those rewards? And the environment rewards the AI in accordance with group success/utility, not just its own, therefore, the AI learns heuristics such as “Helping other agents is good” and “Preventing other agents coming to harm is good” and “Definitely don’t kill other agents, that’s really bad.”? If so, that’s an interesting idea.

You’re totally right about human altruism, which is part of the problem—humans are not aligned to animals in a way that I would be comfortable with if AGI was aligned to us in a similar manner. That said, you are right that the AI training environment would be a lot better than the ancestral one for learning altruism.

I think there’s definitely a lot of unanswered questions in this approach, but they’re looking a lot less like “problems with the approach itself” and a lot more like “Problems that any approach to alignment has to solve in the end”, like “How do you validate the AI has learned what you want it to learn” and “How will the AI generalise from simulated agents to humans”.

I am still concerned with the “trillions of copies” problem, but this doesn’t seem like a problem that is unsolvable in principle, in the same way that, say “Create a prison for a superintelligent AGI that will hold it against its will” seems.

I think this is an interesting approach, but I’m at the limits of my own still-fairly-limited knowledge. Does anyone else see:
- A reason this line of research would collapse?
- Some resources from people who have already been thinking about this and made progress on something similar?

Jay Bailey 9 Jun 2022 7:45 UTC
1 point
0
in reply to: Reuven Falkovich’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
The main difference that I see is, containment supposes that you’re actively opposed to the AGI in some fashion—the AGI wants to get out, and you don’t want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won’t work—and if an AGI is aligned, containment is unnecessary.

By contrast, alignment means you’re not opposed to the AGI—you want what the AGI wants. This is a very difficult problem to achieve, but doesn’t rely on actually outwitting a superintelligence.

I agree that it’s hard to imagine what a foolproof alignment solution would even look like—that’s one of the difficulties of the problem.

Jay Bailey 10 Jun 2022 0:36 UTC
1 point
0
in reply to: One Step for Animals’s comment on: AI Could Defeat All Of Us Combined
This is not a bad point, despite the downvotes—as a question it would definitely belong in the recent AI FAQ thread. It’s not obvious from context that when we talk about “aligned with human values” in AI safety, we tend to mean “aligned with our values directly”, rather than “has a morality system similar to a human”. In computer-science terms, “human values” is a direct pointer to our values, rather than an instance of “the system of morality humans have.”

Imagine two people, Alice and Bob. Alice and Bob both have human values, but they have different values. Both of them want to help people, but Alice values Alice more, and Bob values Bob more. They each have an instance of “human values”.

Now let’s say Alice made an AI, CAROL. In order to be properly aligned, we wouldn’t want CAROL to value itself more than Alice—we would want CAROL to have Alice’s values directly, not “the values Alice would have if Alice were CAROL.” If CAROL had an instance of “human values”, CAROL would want to help people but would value CAROL’s existence above anyone else’s. Instead, we want CAROL to have a combination of Alice’s values and Bob’s values, and we want this to extend across all humans.

Thus, while you’re right that implanting an AI with “human values” in the sense of “The AI has similar morality to us” could cause it to treat us like we treat animals, the approach I’ve heard advocated is to give the AI our specific morality system, which includes a strong preference for humans because we’re humans, even if this preference were arbitrary.

Jay Bailey 10 Jun 2022 8:42 UTC
1 point
0
in reply to: Ericf’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
“Better” in this case depends on the AI’s original utility function. An AI is “better” than its predecessor if it can score more highly on that utility function. As you said, better is arbitrary, but “better with respect to a particular goal” isn’t.

In addition, things like speed, efficiency, correctness, having more copies of yourself, and cognitive capacity are all useful for a wide variety of goals—an AI that is ten times as powerful, ten times faster, significantly more accurate, or capable of understanding more sophisticated concepts as compared to another AI with the same goal is likely to be better at achieving it. This is true for almost any goal you can imagine, whether it be “Create paperclips” or “Maximise human flourishing”.

Jay Bailey 10 Jun 2022 14:17 UTC
2 points
0
in reply to: AmberDawn’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
At a very high level, the way reinforcement learning works is that the AI attempts to maximise a reward function. This reward function can be summed up as “The sum of all rewards you expect to get in the future”. So using a bunch of maths, the AI looks at the rewards it’s got in the past, the rewards it expects to get in the future, and selects the action that maximises the expected future rewards. The reward function can be defined within the algorithm itself, or come from the environment. For instance, if you want to train a four-legged robot to learn to walk, the reward might be the distance travelled in a certain direction. If you want to train it to play an Atari game, the reward is usually the score.

None of this requires any sort of qualia, or for the agent to want things. It’s a mathematical equation. AI behaves in the way it behaves as a result of the algorithm attempting to maximise it, and the AI can be said to “want” to maximise its reward function or “have the goal of” maximising its reward function because it reliably takes actions to move towards this outcome if it’s a good enough AI.