Scheming AIs: Will AIs fake alignment during training in order to get power?

20 Nov 2023 1:17 UTC

This is a LessWrong sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://arxiv.org/pdf/2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I’m hoping it will provide much of the context necessary to understand individual sections of the report on their own.

New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Joe Carlsmith15 Nov 2023 17:16 UTC

79 points

26 comments30 min readLW link

Varieties of fake alignment (Section 1.1 of “Scheming AIs”)

Joe Carlsmith21 Nov 2023 15:00 UTC

15 points

0 comments12 min readLW link

A taxonomy of non-schemer models (Section 1.2 of “Scheming AIs”)

Joe Carlsmith22 Nov 2023 15:24 UTC

13 points

0 comments13 min readLW link

Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”)

Joe Carlsmith24 Nov 2023 19:18 UTC

8 points

0 comments22 min readLW link

On “slack” in training (Section 1.5 of “Scheming AIs”)

Joe Carlsmith25 Nov 2023 17:51 UTC

1 point

0 comments5 min readLW link

Situational awareness (Section 2.1 of “Scheming AIs”)

Joe Carlsmith26 Nov 2023 23:00 UTC

10 points

5 comments8 min readLW link

Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)

Joe Carlsmith27 Nov 2023 18:01 UTC

19 points

1 comment13 min readLW link

Two sources of beyond-episode goals (Section 2.2.2 of “Scheming AIs”)

Joe Carlsmith28 Nov 2023 13:49 UTC

11 points

1 comment15 min readLW link

“Clean” vs. “messy” goal-directedness (Section 2.2.3 of “Scheming AIs”)

Joe Carlsmith29 Nov 2023 16:32 UTC

29 points

1 comment11 min readLW link

Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of “Scheming AIs”)

Joe Carlsmith30 Nov 2023 16:43 UTC

8 points

0 comments6 min readLW link

How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of “Scheming AIs”)

Joe Carlsmith1 Dec 2023 14:51 UTC

10 points

1 comment7 min readLW link

The goal-guarding hypothesis (Section 2.3.1.1 of “Scheming AIs”)

Joe Carlsmith2 Dec 2023 15:20 UTC

8 points

1 comment15 min readLW link

Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of “Scheming AIs”)

Joe Carlsmith3 Dec 2023 18:32 UTC

9 points

0 comments17 min readLW link

Non-classic stories about scheming (Section 2.3.2 of “Scheming AIs”)

Joe Carlsmith4 Dec 2023 18:44 UTC

9 points

0 comments20 min readLW link

Arguments for/against scheming that focus on the path SGD takes (Section 3 of “Scheming AIs”)

Joe Carlsmith5 Dec 2023 18:48 UTC

10 points

0 comments23 min readLW link

The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)

Joe Carlsmith6 Dec 2023 19:28 UTC

10 points

0 comments10 min readLW link

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe Carlsmith7 Dec 2023 15:05 UTC

10 points

1 comment19 min readLW link

Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs”)

Joe Carlsmith8 Dec 2023 21:09 UTC

9 points

0 comments15 min readLW link

Summing up “Scheming AIs” (Section 5)

Joe Carlsmith9 Dec 2023 15:48 UTC

2 points

1 comment11 min readLW link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe Carlsmith11 Dec 2023 16:30 UTC

8 points

0 comments21 min readLW link