AI Interpretability idea: Train an LLM with a “split-brain” architecture. That is, the data in the intermediate layers is siloed into two groups with only a small amount of interaction between the groups. This could be enforced in the linear layers by forcing the cross interaction to be a low-rank matrix, or by adding to the regularizer some matrix norm of the cross interaction matrices. I’m sure this can be generalized from MLP to transformer architecture.
Next, apply standard interpretability tools to see how it uses its two sides to represent concepts. Hopefully you’ll find that there are some concepts and tasks that are more focused on one side than on another.
Next, try to elicit the LLM to give verbal descriptions of which concept is represented in which side. Easy mode is to explicitly tell it how it is designed, give some examples of specific concepts primarily encoded on specific sides, even fine-tune train it with those examples, and ask it to predict in which sides are concepts you haven’t trained/told it about before. Hard mode is not to tell it anything about how it was designed but just give it general prompts about trying to introspect try to get it to figure this out from scratch. Maybe hand it off to some of the LLM whisperers without telling it how it was created.
Finally, if you succeed in hard mode, give the same prompting to a regular LLM and pay attention to what it says.
The idea is that I’m trying to elicit the LLM to introspect in a way that actually gives us information about the features and concepts it uses to generate text, rather than just tell the story of the humanoid character it is playacting as. I want to create a scenario where there is a non-obvious ground truth that we have access to that it could plausibly have access to and be able to share.
I think MIRI’s Logical Inductor idea can be factored into two components, one of which contains the elegant core that is why this idea works so well, and the other is an arbitrary embellishment that obscures what is actually going on. Of course I am calling for this to be recognized and that people should only be teaching and thinking about the elegant core. The elegant core is infinitary markets: Markets that exist for an arbitrarily long time, with commodities that can take arbitrarily long to return dividends, and infinitely many market participants who use every computable strategy. The hack is that the commodities are labeled by sentences in a formal language and the relationships between them are governed by a proof systems. This creates a misleading pattern that that the value of the commodity labeled phi appears to measure the probability that phi is true; in fact what it measures is more like the probability the that proof system will eventually affirm that phi is true, or more precisely like the probability that phi is true in a random model of the theory. Of course what we really care about is the probability phi is actually true, meaning true in the standard model where the things labeled “natural numbers” are actual natural numbers and so on. By combining proof systems and infinitary markets, one obscures how much of the “work” in obtaining accurate information is done by either. I think it is better to study these two things separately. Since proof systems are already well-studies and infinitary markets are the novel idea in MIRI’s work, that means they should primarily study infinitary markets.
While I like a lot of Hanson’s grabby alien model, I do not buy the inference that since humans appeared early in cosmological history, that implies that the cosmic commons are taken quickly and so a lower bound on how often grabby aliens appear. I think that is neglecting the possibility that the early universe is inherently more conducive to creating life, so most life is created early, but these lifeforms may be very far apart.
Question for people who think about corrigibility: Consider a corrigible agent with the goal of coloring a wall red. It is considering two kinds of paint. One paint is a brighter, more red red while the other is a duller red. However, the duller red paint is easier to scrape off to repaint the wall in a different color, while the brighter red is harder to remove or paint over. What is the right choice of paint for the corrigible agent to use? How should a corrigible agent make this decision? What additional information if any does the agent need to decide?
Lightspeed delays lead to multiple technological singularities.
By Yudkowsky’s classification, I’m assuming the Accelerating Change Singularity: As technology gets better, the characteristic timescale at which technological progress is made becomes shorter, so that the time until this reaches physical limits is short from the perspective of our timescale. At a short enough timescale the lightspeed limit becomes important: When information cannot traverse the diameter of civilization in the time until singularity further progress must be made independently in different regions. The subjective time from then may still be large, and without communication the different regions can develop different interest and, after their singularities, compete. As the characteristic timescale becomes shorter the independent regions split further.
AI Interpretability idea: Train an LLM with a “split-brain” architecture. That is, the data in the intermediate layers is siloed into two groups with only a small amount of interaction between the groups. This could be enforced in the linear layers by forcing the cross interaction to be a low-rank matrix, or by adding to the regularizer some matrix norm of the cross interaction matrices. I’m sure this can be generalized from MLP to transformer architecture.
Next, apply standard interpretability tools to see how it uses its two sides to represent concepts. Hopefully you’ll find that there are some concepts and tasks that are more focused on one side than on another.
Next, try to elicit the LLM to give verbal descriptions of which concept is represented in which side. Easy mode is to explicitly tell it how it is designed, give some examples of specific concepts primarily encoded on specific sides, even fine-tune train it with those examples, and ask it to predict in which sides are concepts you haven’t trained/told it about before. Hard mode is not to tell it anything about how it was designed but just give it general prompts about trying to introspect try to get it to figure this out from scratch. Maybe hand it off to some of the LLM whisperers without telling it how it was created.
Finally, if you succeed in hard mode, give the same prompting to a regular LLM and pay attention to what it says.
The idea is that I’m trying to elicit the LLM to introspect in a way that actually gives us information about the features and concepts it uses to generate text, rather than just tell the story of the humanoid character it is playacting as. I want to create a scenario where there is a non-obvious ground truth that we have access to that it could plausibly have access to and be able to share.
Maybe just use a standard Mixture of Experts architecture and try to get it to tell you which expert it’s using?
I think MIRI’s Logical Inductor idea can be factored into two components, one of which contains the elegant core that is why this idea works so well, and the other is an arbitrary embellishment that obscures what is actually going on. Of course I am calling for this to be recognized and that people should only be teaching and thinking about the elegant core. The elegant core is infinitary markets: Markets that exist for an arbitrarily long time, with commodities that can take arbitrarily long to return dividends, and infinitely many market participants who use every computable strategy. The hack is that the commodities are labeled by sentences in a formal language and the relationships between them are governed by a proof systems. This creates a misleading pattern that that the value of the commodity labeled phi appears to measure the probability that phi is true; in fact what it measures is more like the probability the that proof system will eventually affirm that phi is true, or more precisely like the probability that phi is true in a random model of the theory. Of course what we really care about is the probability phi is actually true, meaning true in the standard model where the things labeled “natural numbers” are actual natural numbers and so on. By combining proof systems and infinitary markets, one obscures how much of the “work” in obtaining accurate information is done by either. I think it is better to study these two things separately. Since proof systems are already well-studies and infinitary markets are the novel idea in MIRI’s work, that means they should primarily study infinitary markets.
While I like a lot of Hanson’s grabby alien model, I do not buy the inference that since humans appeared early in cosmological history, that implies that the cosmic commons are taken quickly and so a lower bound on how often grabby aliens appear. I think that is neglecting the possibility that the early universe is inherently more conducive to creating life, so most life is created early, but these lifeforms may be very far apart.
Question for people who think about corrigibility: Consider a corrigible agent with the goal of coloring a wall red. It is considering two kinds of paint. One paint is a brighter, more red red while the other is a duller red. However, the duller red paint is easier to scrape off to repaint the wall in a different color, while the brighter red is harder to remove or paint over. What is the right choice of paint for the corrigible agent to use? How should a corrigible agent make this decision? What additional information if any does the agent need to decide?
Crossposted on my blog:
Lightspeed delays lead to multiple technological singularities.
By Yudkowsky’s classification, I’m assuming the Accelerating Change Singularity: As technology gets better, the characteristic timescale at which technological progress is made becomes shorter, so that the time until this reaches physical limits is short from the perspective of our timescale. At a short enough timescale the lightspeed limit becomes important: When information cannot traverse the diameter of civilization in the time until singularity further progress must be made independently in different regions. The subjective time from then may still be large, and without communication the different regions can develop different interest and, after their singularities, compete. As the characteristic timescale becomes shorter the independent regions split further.