My current research interests:
- alignment in systems which are complex and messy, composed of both humans and AIs?
- actually good mathematized theories of cooperation and coordination
- active inference
- bounded rationality
Research at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University
Previously I was a researcher in physics, studying phase transitions, network science and complex systems.
I’m not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals
1. Disutility from resource acquisition—e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift—the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Principles which counteract unbounded rationality
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned
Traps
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.
Oversight
10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans, AI debate style, feeding into previous
Interpretability
14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent “translator” system
I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don’t think it makes any sense for me to do that,. (I’d happily agree to claims of the type “Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics”, but my understanding of the claim is more in the direction “all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones”)