Three AI Safety Related Ideas

(I have a health prob­lem that is act­ing up and mak­ing it hard to type for long pe­ri­ods of time, so I’m con­dens­ing three posts into one.)

1. AI de­sign as op­por­tu­nity and obli­ga­tion to ad­dress hu­man safety problems

Many AI safety prob­lems are likely to have coun­ter­parts in hu­mans. AI de­sign­ers and safety re­searchers shouldn’t start by as­sum­ing that hu­mans are safe (and then try to in­duc­tively prove that in­creas­ingly pow­er­ful AI sys­tems are safe when de­vel­oped/​trained by and added to a team of hu­mans) or try to solve AI safety prob­lems with­out con­sid­er­ing whether their de­signs or safety ap­proaches ex­ac­er­bate hu­man safety prob­lems rel­a­tive to other de­signs /​ safety ap­proaches. At the same time, the de­vel­op­ment of AI may be a huge op­por­tu­nity to ad­dress hu­man safety prob­lems, for ex­am­ple by trans­fer­ring power from prob­a­bly un­safe hu­mans to de novo AIs that are de­signed from the ground up to be safe, or by as­sist­ing hu­mans’ built-in safety mechanisms (such as moral and philo­soph­i­cal re­flec­tion).

2. A hy­brid ap­proach to the hu­man-AI safety problem

Ideal­ized hu­mans can be safer than ac­tual hu­mans. An ex­am­ple of ideal­ized hu­man is a hu­man whole-brain em­u­la­tion that is placed in a fa­mil­iar, safe, and sup­port­ive vir­tual en­vi­ron­ment (along with other hu­mans for so­cial­iza­tion), so that they are not sub­ject to prob­le­matic “dis­tri­bu­tional shifts” nor vuln­er­a­ble to ma­nipu­la­tion from other pow­er­ful agents in the phys­i­cal world. One way to take ad­van­tage of this is to de­sign an AI that is ul­ti­mately con­trol­led by a group of ideal­ized hu­mans (for ex­am­ple, has a ter­mi­nal goal that refers to the re­flec­tive equil­ibrium of the ideal­ized hu­mans), but this seems im­prac­ti­cal due to com­pu­ta­tional con­straints. An idea to get around this is to give the AI an ad­vice or hint, that it can serve that ter­mi­nal goal by learn­ing from ac­tual hu­mans as an in­stru­men­tal goal. This learn­ing can in­clude imi­ta­tion learn­ing, value learn­ing, or other kinds of learn­ing. Then, even if the ac­tual hu­mans be­come cor­rupted, the AI has a chance of be­com­ing pow­er­ful enough to dis­card its de­pen­dence on ac­tual hu­mans and re­com­pute its in­stru­men­tal goals di­rectly from its ter­mi­nal goal. (Thanks to Vladimir Nesov for giv­ing me a hint that led to this idea.)

3. Sev­eral ap­proached to AI al­ign­ment will differ­en­tially ac­cel­er­ate in­tel­lec­tual progress that are analo­gous to solv­ing prob­lems that are low in the polyno­mial hi­er­ar­chy.

This is bad if the “good” kind of in­tel­lec­tual progress (such as philo­soph­i­cal progress) is dis­pro­por­tion­ally high in the hi­er­ar­chy or out­side PH en­tirely, or if we just don’t know how to for­mu­late such progress as prob­lems low in PH. I think this is­sue needs to be on the radar of more AI safety re­searchers.

(A reader might ask, “differ­en­tially ac­cel­er­ate rel­a­tive to what?” An “al­igned” AI could ac­cel­er­ate progress in a bad di­rec­tion rel­a­tive to a world with no AI, but still in a good di­rec­tion rel­a­tive to a world with only un­al­igned AI. I’m refer­ring to the former here.)