JanB

Karma: 731

The expected value of extinction risk reduction is positive

JanB9 Jun 2019 15:49 UTC

22 points

0 comments61 min readLW link

JanB 7 Feb 2020 14:21 UTC
LW: 1 AF: 1
AF
on: Disentangling arguments for the importance of AI safety
I struggle to understand the difference between #2 and #3. The prosaic AI alignment problem only exists because we don’t know how to make an agent that tries to do what we want it to do. Would you say that #3 is a concrete scenario for how #2 could lead to a catastrophe?

[Question] Does iterated amplification tackle the inner alignment problem?

JanB15 Feb 2020 12:58 UTC

7 points

4 comments1 min readLW link

[Question] What is the difference between robustness and inner alignment?

JanB15 Feb 2020 13:28 UTC

9 points

2 comments1 min readLW link

JanB 16 Feb 2020 17:35 UTC
LW: 2 AF: 1
AF
on: Does iterated amplification tackle the inner alignment problem?
Thanks for all your answers :-)

JanB 5 Mar 2022 10:09 UTC
LW: 2 AF: 1
AF
on: How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?
I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
- pointwise mutual information as in the Anthropic paper, or
- adding the sentence “Which of these 5 options is the correct one?” to the end of the prompt and then evaluating the likelihood of “A”, “B”, “C”, “D”, and “E”?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.

JanB 5 Apr 2022 9:34 UTC
10 points
on: Google’s new 540 billion parameter language model
Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:
- PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
- PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
- within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.

JanB 1 May 2022 12:25 UTC
LW: 1 AF: 1
AF
in reply to: Richard_Ngo’s comment on: AI safety from first principles: Goals and Agency

Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

The distinction that you’re pointing at is useful. But I would have filed it under “difference in the degree of agency”, not under “difference in goals”. When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.

E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of “proving the Riemann hypothesis”, but System B has “more agency”: it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.

JanB 19 May 2022 7:33 UTC
LW: 19 AF: 10
AF
on: How to get into AI safety research
I guess I’d recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum

On Stuart’s list: I think this list might be suitable for some types of conceptual alignment research. But you’d certainly want to read more ML for other types of alignment research.

JanB 6 Jun 2022 13:28 UTC
LW: 4 AF: 3
AF
on: High-stakes alignment via adversarial training [Redwood Research report]
Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term “adversarial examples” or “adversarial training” run deep :-)

I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that “adversarial examples” here doesn’t mean the thing that it usually means when I encounter the term (i.e. “small” changes to an input that change the classification, for some definition of small).

There were several instances when my brain went “Wait, that’s not how adversarial examples work”, followed by short confusion, followed by “right, that’s because my cached concept of X is only true for “adversarial examples as commonly defined in ML”, not for “adversarial examples as defined here”.

Some ideas for follow-up projects to Redwood Research’s recent paper

JanB6 Jun 2022 13:29 UTC

10 points

0 comments7 min readLW link

JanB 7 Jun 2022 16:10 UTC
LW: 10 AF: 6
AF
on: High-stakes alignment via adversarial training [Redwood Research report]
Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?
In the paper, you write: “There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers.”
But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.

JanB 27 Jun 2022 18:16 UTC
LW: 7 AF: 5
4
AF
on: Announcing Epoch: A research organization investigating the road to Transformative AI
So happy to see this, and such an amazing team!

JanB 29 Jul 2022 11:16 UTC
LW: 4 AF: 2
1
AF
on: Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I’ll share it widely.

Also, I’d be curious if people know of other good “concrete stories of AI catastrophe”, ideally with ample technical detail.

Your posts should be on arXiv

JanB25 Aug 2022 10:35 UTC

149 points

44 comments3 min readLW link

JanB 25 Aug 2022 13:32 UTC
LW: 18 AF: 6
1
AF
in reply to: Richard_Kennaway’s comment on: Your posts should be on arXiv
Ah, I had forgotten about this. I’m happy to endorse people or help them find endorsers.

JanB 25 Aug 2022 13:33 UTC
LW: 2 AF: 2
8
AF
in reply to: Ramana Kumar’s comment on: Your posts should be on arXiv
It probably could, although I’d argue that even if not, quite often it would be worth the author’s time.

JanB 26 Aug 2022 17:14 UTC
LW: 2 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: Your posts should be on arXiv
Ah, sorry to hear. I wouldn’t have predicted this from reading arXiv’s content moderation guidelines.

JanB 28 Aug 2022 13:59 UTC
LW: 10 AF: 5
7
AF
in reply to: Dan H’s comment on: Your posts should be on arXiv
I agree that formatting is the most likely issue. The content of Neel’s grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, …).

So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).

JanB 30 Aug 2022 16:18 UTC
LW: 3 AF: 3
0
AF
in reply to: Dan H’s comment on: Your posts should be on arXiv

Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv’s .CY section.

As an explanation, because this just took me 5 minutes of search: This is the section “Computers and Society (cs.CY)”

JanB

The ex­pected value of ex­tinc­tion risk re­duc­tion is positive

[Question] Does iter­ated am­plifi­ca­tion tackle the in­ner al­ign­ment prob­lem?

[Question] What is the differ­ence be­tween ro­bust­ness and in­ner al­ign­ment?

Some ideas for fol­low-up pro­jects to Red­wood Re­search’s re­cent paper

Your posts should be on arXiv

The expected value of extinction risk reduction is positive

[Question] Does iterated amplification tackle the inner alignment problem?

[Question] What is the difference between robustness and inner alignment?

Some ideas for follow-up projects to Redwood Research’s recent paper