The object-level facts are not written by or comprehensible to humans, no. What’s comprehensible is the algorithm the AI agent uses to form beliefs and make decisions based on those beliefs. Yudkowsky often compares gradient descent optimizing a model to evolution optimizing brains, so he seems to think that understanding the outer optimization algorithm is separate from understanding the inner algorithms of the neural network’s “mind”.
I think what he imagines as a non-inscrutable AI design is something vaguely like “This module takes in sense data and uses it to generate beliefs about the world which are represented as X and updated with algorithm Y, and algorithm Z generates actions, and they’re graded with a utility function represented as W, and we can prove theorems and do experiments with all these things in order to make confident claims about what the whole system will do.”(The true design would be way more complicated, but still comprehensible.)
Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:
Decision-making processes and ontologies are separable. An AI system’s decision process can be isolated and connected to a different world-model, or vice versa.
The decision-making process is human-comprehensible and has a much shorter description length than the ontology.
As AI systems become more powerful, their decision processes approach a theoretically optimal decision theory that can also be succinctly expressed and understood by human researchers.
None of these assumptions ended up being true of LLMs. In an LLM, the world-model and decision process are mixed together in a single neural network instead of being separate entities. LLMs don’t come with decision-related concepts like “hypothesis” and “causality” pre-loaded; those concepts are learned over the course of training and are represented in the same messy, polysemantic way as any other learned concept. There’s no way to separate out the reasoning-related features to get a decision process you could plug into a different world-model. In addition, when LLMs are scaled up, their decision-making becomes more complex and inscrutable due to being distributed across the neural network. The LLM’s decision-making process doesn’t converge into a simple and human-comprehensible decision theory.
This is what I think he means:
The object-level facts are not written by or comprehensible to humans, no. What’s comprehensible is the algorithm the AI agent uses to form beliefs and make decisions based on those beliefs. Yudkowsky often compares gradient descent optimizing a model to evolution optimizing brains, so he seems to think that understanding the outer optimization algorithm is separate from understanding the inner algorithms of the neural network’s “mind”.
I think what he imagines as a non-inscrutable AI design is something vaguely like “This module takes in sense data and uses it to generate beliefs about the world which are represented as X and updated with algorithm Y, and algorithm Z generates actions, and they’re graded with a utility function represented as W, and we can prove theorems and do experiments with all these things in order to make confident claims about what the whole system will do.”(The true design would be way more complicated, but still comprehensible.)
Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:
Decision-making processes and ontologies are separable. An AI system’s decision process can be isolated and connected to a different world-model, or vice versa.
The decision-making process is human-comprehensible and has a much shorter description length than the ontology.
As AI systems become more powerful, their decision processes approach a theoretically optimal decision theory that can also be succinctly expressed and understood by human researchers.
None of these assumptions ended up being true of LLMs. In an LLM, the world-model and decision process are mixed together in a single neural network instead of being separate entities. LLMs don’t come with decision-related concepts like “hypothesis” and “causality” pre-loaded; those concepts are learned over the course of training and are represented in the same messy, polysemantic way as any other learned concept. There’s no way to separate out the reasoning-related features to get a decision process you could plug into a different world-model. In addition, when LLMs are scaled up, their decision-making becomes more complex and inscrutable due to being distributed across the neural network. The LLM’s decision-making process doesn’t converge into a simple and human-comprehensible decision theory.