harfe

Karma: 365

 Yoshua Bengio: How Rogue AIs may Arise

harfe23 May 2023 18:28 UTC

92 points

12 comments18 min readLW link

(yoshuabengio.org)

harfe 28 Jun 2022 14:06 UTC
74 points
28
on: Contest: An Alien Message
Solution:
```
#!/usr/bin/env python3

import struct

a = [1.3, 2.1, -0.5] # initialize data
L = 2095 # total length
i = 0

while len(a)<L:
    if abs(a[i]) > 2.0 or abs(a[i+1]) > 2.0:
        a.append(a[i]/2)
        i += 1
    else:
        a.append(a[i] * a[i+1] + 1.0)
        a.append(a[i] - a[i+1])
        a.append(a[i+1] - a[i])
        i += 2

f = open("out_reproduced.bin","wb") # save in binary
f.write(struct.pack('d'*L,*a)) # use IEEE 754 double format
f.close()
```
Then one can check that the produced file is identical:
```
$ sha1sum *bin
70d32ee20ffa21e39acf04490f562552e11c6ab7  out.bin
70d32ee20ffa21e39acf04490f562552e11c6ab7  out_reproduced.bin
```
edit: How I found the solution: I found some of the other comments helpful, especially from gjm (although I did not read all). In particular, interpreting the data as a sequence of 64-bit floating point numbers saved me a lot of time. Also gjm’s mention of the pattern a, -a, b, c, -c, d was an inspiration. If you look at the first couple of numbers, you can see that they are sometimes half of an earlier number. Playing around further with the numbers I eventually found the patterns a[i] * a[i+1] + 1.0 and a[i] - a[i+1]. It remained to figure out when the a[i]/2 rule applies and when the a[i] * a[i+1] + 1.0 rule applies. Here it was a hint that the numbers do not grow too large in size. After trying out several rules that form bounds on a[i] and a[i+1], I eventually found the right one.
What links here?
- Alien Message Contest: Solution by DaemonicSigil (13 Jul 2022 4:07 UTC; 29 points)
- Challenge: A Much More Alien Message by kman (28 Jun 2022 21:50 UTC; 24 points)

harfe 6 Jul 2022 20:33 UTC
33 points
32
on: Deep neural networks are not opaque.
I suspect you use the word “opaque” in a different way than Eliezer Yudkowsky here. At least I fail to see from your summary, how this would contradict my interpretation of Eliezer’s statement (and your title and introduction seems to imply that it is a contradiction).

Consider the hypothetical example, where GPT-3 states (incorrectly) that Geneva is the capital of Switzerland. Can we look at the weights of GPT-3 and see if it was just playing dumb or if it genuinely thinks that Geneva is the capital of Switzerland? If the weights/”matrices”/”giant wall of floating point numbers” are opaque (in the sense of Eliezer according to my guess), then we would look at it and shrug our shoulders. I fail to see from your summary, how the effective theories would help in this example. (Disclaimer: In this specific example or similar examples, I would not be surprised if it was actually possible to figure out if it was playing dumb or what caused GPT-3 to play dumb. Also I do not expect GPT-3 to actually believe that Geneva is the capital of Switzerland).

My guess of your meaning of “opaque” would be something like “we have no idea why deep learning works at all” or “we have no mathematical theory for the training of neural nets”, which your summary disproves.

harfe 9 May 2023 1:57 UTC
21 points
14
on: Yoshua Bengio argues for tool-AI and to ban “executive-AI”
Overall this is still encouraging. It seems to take serious that
- value alignment is hard
- executive-AI should be banned
- banning executive-AI would be hard
- alignment research and AI safety is worthwhile.
I feel like there are enough shared assumptions that collaboration or dialogue with AI notkilleveryoneists could be very useful.

That said, I wish there were more details about his Scientist AI idea:
- How exactly will the Scientist AI be used?
- Should we expect the Scientist AI to have situational awareness?
- Would the Scientist AI be allowed to write large scale software projects that are likely to get executed after a brief reviewing of the code by a human?
- Are there concerns about Mesa-optimization?
Also it is not clear to me whether the safety is supposed to come from:
- the AI cannot really take actions in the world (and even when there is a superhuman AI that wants to do large-scale harms, it will not succeed, because it cannot take actions that achieve these goals)
- the AI has no intrinsic motivation for large-scale harm (while its output bits could in principle create large-scale harm, such a string of bits is unlikely because there is no drive towards these string of bits).
- a combination of these two.

Infra-Bayesian Logic

harfe and Yegreg

5 Jul 2023 19:16 UTC

15 points

2 comments1 min readLW link

harfe 21 Feb 2023 3:20 UTC
12 points
17
in reply to: Adele Lopez’s comment on: There are no coherence theorems
The title “There are no coherence theorems” seems click-baity to me, when the claim relies on a very particular definition “coherence theorem”. My thought upon reading the title (before reading the post) was something like “surely, VNM would count as a coherence theorem”. I am also a bit bothered by the confident assertions that there are no coherence theorems in the Conclusion and Bottom-lines for similar reason.

harfe 12 Jun 2023 13:09 UTC
10 points
8
on: UK PM: $125M for AI safety
This does not sound very encouraging from the perspective of AI Notkilleveryoneism. When the announcement of the foundation model task force talks about safety, I cannot find hints that they mean existential safety. Rather, it seems about safety for commercial purposes.

A lot of the money might go into building a foundation model. At least they should also announce that they will not share weights and details on how to build it, if they are serious about existential safety.

This might create an AI safety race to the top as a solution to the tragedy of the commons

This seems to be the opposite of that. The announcement talks a lot about establishing UK as a world leader, e.g. “establish the UK as a world leader in foundation models”.

harfe 24 Sep 2023 1:22 UTC
9 points
4
on: The Dick Kick’em Paradox
The setup violates a fairness condition that has been talked about previously.

From https://arxiv.org/pdf/1710.05060.pdf, section 9:

We grant that it is possible to punish agents for using a speciﬁc decision proce- dure, or to design one decision problem that punishes an agent for rational behavior in a diﬀerent decision problem. In those cases, no decision theory is safe. CDT per- forms worse that FDT in the decision problem where agents are punished for using CDT, but that hardly tells us which theory is better for making decisions. [...]

Yet FDT does appear to be superior to CDT and EDT in all dilemmas where the agent’s beliefs are accurate and the outcome depends only on the agent’s behavior in the dilemma at hand. Informally, we call these sorts of problems “fair problems.” By this standard, Newcomb’s problem is fair; Newcomb’s predictor punishes and rewards agents only based on their actions. [...]

There is no perfect decision theory for all possible scenarios, but there may be a general-purpose decision theory that matches or outperforms all rivals in fair dilem- mas, if a satisfactory notion of “fairness” can be formalized

harfe 10 Jul 2023 15:08 UTC
9 points
0
in reply to: Vanessa Kosoy’s comment on: The Learning-Theoretic Agenda: Status 2023
I think the conjecture is also false in the case that utility functions map from $O^{ω}$ to $[0, 1]$ .

Let us consider the case of $A = {a_{1}, a_{2}}$ and $O = {o_{1}, o_{2}}$ . We use $U_{1} (o) = 1 - 2^{- k}$ , where $k$ is the largest integer such that $o$ starts with $o_{1}^{k}$ (and $U_{1} (o_{1}^{ω}) = 1$ ). As for $U_{2}$ , we use $U_{2} (o) = 1 - 3^{- k}$ , where $k$ is the largest integer such that $o$ starts with $o_{1}^{k}$ (and $U_{2} (o_{1}^{ω}) = 1$ ). Both $U_{1}$ and $U_{2}$ are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy $π$ that always picks action $a_{1}$ is the optimal policy for both $U_{1}$ and $U_{2}$ (see proof below).

Note that since the policy is computable and very simple, $g_{0} (π) = \infty$ is not true, and we have $g_{0} (π) = O (1)$ instead. I suspect that the issues are still present even with an additional $g_{0} (π) = \infty$ condition, but finding a concrete example with an uncomputable policy is challenging.

proof: Suppose that $U_{1}$ and $U_{2}$ are locally equivalent. Let $V$ be an open neighborhood of the point $x = o_{1}^{ω}$ and $α > 0$ , $β \in R$ be such that $U_{1} (y) = α U_{2} (y) + β$ for all $y \in V$ .

Since $x \in V$ , we have $1 = U_{1} (x) = α U_{2} (x) + β = α + β$ . Because $V$ is an open neighborhood of $o_{1}^{ω}$ , there is an integer $N$ such that $o_{1}^{n} o_{2}^{ω} \in V$ for all $n \geq N$ . For such $n \geq N$ , we have $1 - 2^{- n} = U_{1} (o_{1}^{n} o_{2}^{ω}) = α U_{2} (o_{1}^{n} o_{2}^{ω}) + β = α (1 - 3^{- n}) + β = α + β - α 3^{- n} = 1 - α 3^{- n} .$ This implies $α = (2 / 3)^{- n}$ . However, this is not possible for all $n \geq N$ . Thus, our assumption that $U_{1}$ and $U_{2}$ are locally equivalent was wrong.

Assumptions about the solomonoff prior: For all $n$ , the sequence of actions that produces the sequence of $o_{1}^{n}$ with the highest probability is $a_{1}^{n - 1}$ (recall that we start with observations in this setting). With this assumption, it can be seen that the policy that always picks action $a_{1}$ is among the best policies for both $U_{1}$ and $U_{2}$ .

I think this is actually a natural behaviour for a reasonable Solomonoff prior: It is natural to expect that $o_{1} a_{1} o_{1}$ is more likely than $o_{1} a_{2} o_{1}$ . It is natural to expect that the sequence of actions that leads to $o_{1}$ over $o_{2}$ has low complexity. Always picking $a_{1}$ is low complexity.

It is possible to construct an artificial UTM that ensures that “always take $a_{1}$ ” is the best policy for $U_{1}$ , $U_{2}$ : An UTM can be constructed such that the corresponding Solomonoff prior assigns ³⁄₄ probability to the program/environment “start with o_1. after action a_i, output o_i”. The rest of the probability mass gets distributed according to some other more natural UTM.

Then, for $U_{1}$ , in each situation with history $o_{1}^{n}$ the optimal policy has to pick $a_{1}$ (the actions outside of this history have no impact on the utility): With ³⁄₄ probability it will get utility of at least $1 - 2^{- (n + 1)}$ . And with $1 / 4$ probability at least $1 - 2^{- n}$ . Whereas, for the choice of $a_{2}$ , with probability $3 / 4$ it will have utility of $1 - 2^{- n}$ , and with probability $1 / 4$ it can get at most $1$ . We calculate $(1 - 2^{- (n + 1)}) 3 / 4 + (1 - 2^{- n}) 1 / 4 = 1 - 5 2^{- (n + 3)} 1 - 3 \cdot 2^{- (n + 2)} = (1 - 2^{- n}) 3 / 4 + 1 / 4$ , ie. taking action $a_{1}$ is the better choice.

Similarly, for $U_{2}$ , the optimal policy has to pick $a_{1}$ too in each situation with history $o_{1}^{n}$ . Here, the calculation looks like $(1 - 3^{- (n + 1)}) 3 / 4 + (1 - 3^{- n}) 1 / 4 = 1 - 3^{- n} / 21 - \cdot 3^{- n + 1} / 4 = (1 - 3^{- n}) 3 / 4 + 1 / 4$ .

harfe 23 May 2023 18:49 UTC
9 points
2
on: Yoshua Bengio: How Rogue AIs may Arise
I think overall this is a well-written blogpost. His previous blogpost already indicated that he took the arguments seriously, so this is not too much of a surprise. That previous blogpost was discussed and partially criticized on Lesswrong. As for the current blogpost, I also find it noteworthy that active LW user David Scott Krueger is in the acknowledgements.

This blogpost might even be a good introduction for AI xrisk for some people.

I hope he engages further with the issues. For example, I feel like inner misalignment is still sort of missing from the arguments.

harfe 21 Oct 2022 14:46 UTC
9 points
1
on: Learning societal values from law as part of an AGI alignment strategy

P(misalignment x-risk | AGI that understands democratic law) < P(misalignment x-risk | AGI)

I don’t think this is particularly compelling. While technically true, the difference between those probabilities is tiny. Any AGI is highly likely to understand democratic laws.

harfe 10 Jan 2023 22:56 UTC
8 points
5
on: AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years

if you believe that financial markets are wrong, then you have the opportunity to (1) borrow cheaply today and use that money to e.g. fund AI safety work

How exactly would I go about doing that? A-priori this seems difficult: If there were opportunities to cheaply borrow money for eg 10 years, lots of people who have strong time discounting would take that option.

harfe 28 Jun 2022 16:24 UTC
8 points
on: I No Longer Believe Intelligence to be “Magical”

During my 2017 binge of LW, I recall Yudkowsky suggesting that a superintelligence could infer the laws of physics from a single frame of video showing a falling apple (Newton apparently came up with his idea of gravity from observing a falling apple).

I now think that’s somewhere between deeply magical and utter nonsense.

Some details here: You are likely referring to That Alien Message. In my opinion Eliezer Yudkowsky made a weaker claim than you are implying:

A Bayesian superintelligence, hooked up to a webcam, would invent General Relativity as a hypothesis—perhaps not the dominant hypothesis, compared to Newtonian mechanics, but still a hypothesis under direct consideration—by the time it had seen the third frame of a falling apple. It might guess it from the first frame, if it saw the statics of a bent blade of grass.

To me it does not seem hard for a superintelligence (or 1000 years of Einstein-level thinking) to come up with Newtonian mechanics as a hypothesis from three frames of a falling apple. But I am not sure about the (weakly stated) suggestion that you could derive it from a picture of a bent blade of grass.

harfe 6 Feb 2023 20:14 UTC
7 points
3
on: Why Theorems? A Personal Perspective
Nice post! Here are some thoughts:
- We do not necessarily need fully formally proven theorems, other forms of high confidence in safety could be sufficient. For example, I would be comfortable with turning on an AI that is safe iff the Goldbach conjecture is true, even if we have no proof of the Goldbach conjecture.
- We currently don’t have any idea what kind of theorems we want to prove. Formulating the right theorem is likely more difficult than proving it.
- Theorems can rely on imperfect assumptions (that are not exact as in the real world). In such a case, it is not clear that they give us the degree of high confidence that we would like to have.
- Theorems that rely on imperfect assumptions could still be very valuable and increase overall safety, nonetheless. For example, if we could prove something like “this software is corrigible, assuming we are in a world run by Newtonian physics” then this could (depending on the details) be high evidence for the software being corrigible in a Quantum world.

harfe 4 Dec 2022 14:38 UTC
7 points
1
on: AI can exploit safety plans posted on the Internet
I dislike the framing of this post. Reading this post made the impression that
- You wrote a post with a big prediction (“AI will know about safety plans posted on the internet”)
- Your post was controversial and did not receive a lot of net-upvotes
- Comments that disagree with you receive a lot of upvotes. Here you make me think that these upvoted comments disagree with the above prediction.
But actually reading the original post and the comments reveals a different picture:
- The “prediction” was not a prominent part of your post.
- The comments such as this imo excellent comment did not disagree with the “prediction”, but other aspects of your post.
Overall, I think its highly likely that the downvotes where not because people did not believe that future AI systems will know about safety plans posted on LW/EAF, but because of other reasons. I think people were well aware that AI systems will get to know about plans for AI safety, just as I think that it is very likely that this comment itself will be found in the training data of future AI systems.

harfe 20 Dec 2022 19:05 UTC
6 points
1
on: I believe some AI doomers are overconfident
I wonder if someone could create a similar structured argument for the opposite viewpoint. (Disclaimer: I do not endorse a mirrored argument of OP’s argument)

You could start with “People who believe there is a >50% possibility of humanity’s survival in the next 50 years or so strike me as overconfident.”, and then point out that for every plan of humanity’s survival, there are a lot of things that could potentially go wrong.

The analogy is not perfect, but to a first approximation, we should expect that things can go wrong in both directions.

harfe 5 Oct 2023 23:06 UTC
5 points
4
on: Provably Safe AI
Thank you for writing this review.

The strategy assumes we’ll develop a good set of safety properties that we’re demanding proof of.

I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.

To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?

The same holds for Asimov’s three laws of robotics, turning these into actual math or code seems to be quite challenging.

There’s likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications.

Probably obvious to many, but I’d like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.

harfe 9 Jun 2023 22:32 UTC
5 points
4
in reply to: jimrandomh’s comment on: Transformative AGI by 2043 is <1% likely
There is an additional problem where one of the two key principles for their estimates is

Avoid extreme confidence

If this principle leads you to picking probability estimates that have some distance to 1 (eg by picking at most 0.95).

If you build a fully conjunctive model, and you are not that great at extreme probabilities, then you will have a strong bias towards low overall estimates. And you can make your probability estimates even lower by introducing more (conjunctive) factors.

harfe 31 May 2023 19:47 UTC
5 points
1
on: Improving Mathematical Reasoning with-Process Supervision
From just reading your excerpt (and not the whole paper), it is hard to determine how much alignment washing is going on here.
- what is aligned chain-of-thought? What would unaligned chain-of-thought look like?
- what exactly means alignment in the context of solving math problems?
But maybe these worries can be answered from reading the full paper...

harfe 9 May 2023 0:29 UTC
5 points
2
in reply to: Chris_Leong’s comment on: Yoshua Bengio argues for tool-AI and to ban “executive-AI”
Potentially relevant: Yoshua Bengio got funding from OpenPhil in 2017:

https://www.openphilanthropy.org/grants/montreal-institute-for-learning-algorithms-ai-safety-research/

harfe

 Yoshua Ben­gio: How Rogue AIs may Arise

In­fra-Bayesian Logic

 Yoshua Bengio: How Rogue AIs may Arise

Infra-Bayesian Logic