Scheming AIs: Will AIs fake alignment during training in order to get power?

This is a LessWrong sequence version of my report “Scheming AIs: Will AIs fake alignment during training in order to get power?”, available on arXiv here: https://​​arxiv.org/​​pdf/​​2311.08379.pdf. It’s a long report, and I’m hoping that having shorter sections available as separate posts will make them easier to digest, reference, and comment on.

The first post in the sequence contains a summary of the full report. The summary covers most of the main points and technical terms, and I’m hoping it will provide much of the context necessary to understand individual sections of the report on their own.

New re­port: “Schem­ing AIs: Will AIs fake al­ign­ment dur­ing train­ing in or­der to get power?”

Va­ri­eties of fake al­ign­ment (Sec­tion 1.1 of “Schem­ing AIs”)

A tax­on­omy of non-schemer mod­els (Sec­tion 1.2 of “Schem­ing AIs”)

Why fo­cus on schemers in par­tic­u­lar (Sec­tions 1.3 and 1.4 of “Schem­ing AIs”)

On “slack” in train­ing (Sec­tion 1.5 of “Schem­ing AIs”)

Si­tu­a­tional aware­ness (Sec­tion 2.1 of “Schem­ing AIs”)

Two con­cepts of an “epi­sode” (Sec­tion 2.2.1 of “Schem­ing AIs”)

Two sources of be­yond-epi­sode goals (Sec­tion 2.2.2 of “Schem­ing AIs”)

“Clean” vs. “messy” goal-di­rect­ed­ness (Sec­tion 2.2.3 of “Schem­ing AIs”)

Is schem­ing more likely in mod­els trained to have long-term goals? (Sec­tions 2.2.4.1-2.2.4.2 of “Schem­ing AIs”)

How use­ful for al­ign­ment-rele­vant work are AIs with short-term goals? (Sec­tion 2.2.4.3 of “Schem­ing AIs”)

The goal-guard­ing hy­poth­e­sis (Sec­tion 2.3.1.1 of “Schem­ing AIs”)

Does schem­ing lead to ad­e­quate fu­ture em­pow­er­ment? (Sec­tion 2.3.1.2 of “Schem­ing AIs”)

Non-clas­sic sto­ries about schem­ing (Sec­tion 2.3.2 of “Schem­ing AIs”)

Ar­gu­ments for/​against schem­ing that fo­cus on the path SGD takes (Sec­tion 3 of “Schem­ing AIs”)

The count­ing ar­gu­ment for schem­ing (Sec­tions 4.1 and 4.2 of “Schem­ing AIs”)

Sim­plic­ity ar­gu­ments for schem­ing (Sec­tion 4.3 of “Schem­ing AIs”)

Speed ar­gu­ments against schem­ing (Sec­tion 4.4-4.7 of “Schem­ing AIs”)

Sum­ming up “Schem­ing AIs” (Sec­tion 5)

Em­piri­cal work that might shed light on schem­ing (Sec­tion 6 of “Schem­ing AIs”)