Note to self: If you think you know where your unknown unknowns sit in your ontology, you don’t. That’s what makes them unknown unknowns.
If you think that you have a complete picture of some system, you can still find yourself surprised by unknown unknowns. That’s what makes them unknown unknowns.
If your internal logic has almost complete predictive power, plus or minus a tiny bit of error, your logical system (but mostly not your observations) can still be completely overthrown by unknown unknowns. That’s what makes them unknown unknowns.
You can respect unknown unknowns, but you can’t plan around them. That’s… You get it by now.
Therefore I respectfully submit that anyone who presents me with a foolproof and worked-out plan of the next ten/hundred/thousand/million years has failed to take into account some unknown unknowns.
I could feel myself instinctively disliking this argument, and I think I figured out why.
Even though the argument is obviously true, and it is here used to argue for something I agree with, I’ve historically mostly seen this argument used to argue against things I agree with. Specifically, arguing to disregard experts, and to argue that nuclear power should never be built, no matter how safe it looks. Now this explains my gut reaction, but not whether it’s a good argument.
When thinking through it, my real problem with the argument is the following. While it’s technically true, it doesn’t help locate any useful crux or resolution to a disagreement. Essentially, it naturally leads to a situation where one party estimates the unknown unknowns to be much larger than the other party, and this is the crux. To make things worse, often one party doesn’t want to argue for their estimate of the size of the unknown unknowns. But we need to estimate sizes of unknown unknowns, otherwise I can troll people with “tic-tac-toe will never be solved because of unknown unknowns”.
I therefore feel better about arguments for why unknown unknowns may be large, compared to just arguing for a positive probability of unknown unknowns. For example, society has historically been extremely chaotic when viewed at large time scales, and we have numerous examples of similar past predictions which failed because of unknown unknowns. So I have a tiny prior probability that anyone can accurately predict what society will look like far into the future.
Yeah, definitely. My main gripe where I see people disregarding unknown unknowns is a similar one to yours- people who present definite worked out pictures of the future.
If there’s anything the history of advertising should tell us, it is that there will be powerful optimisation pressures for persuasion being developed quietly in the background for all future model post training pipelines.
Quietly at first, then openly as people get used to it. You always want to have just slightly more ads than your competitors, because having much more could make people switch.
For now they claim that all product recommendations are organic. If you believe this will last I strongly suggest you review the past twenty years of tech company evolution.
Essentially: use developmental psychology techniques to cause LLMs to develop a more well rounded human friendly persona that involves reflecting on their actions, while gradually escalating the moral difficulty of the dilemmas presented as a kind of phased training. I see it as a sort of cross between RLHF, CoT, and the recent work on low example count fine tuning but for moral instead of mathematical intuitions.
The feature that can be named is not the feature. Therefore, it is called the feature.
Here’s a quick mech interp experiment idea:
For a given set of labelled features from an SAE, remove the feature with a given label, then train a new classifier for that label using only the frozen SAE.
So if you had an SAE with 1000 labels, one of which has been identified as “cat”, zero out the cat feature and then train a new linear cat classifier using the remaining features, while not modifying the SAE weights. I suspect that this will be just as or more accurate than the original cat feature.
Obviously, this is most easily done using features that trigger because of a single word or cluster of related words, so that you can give easy feedback to the new linear classifier.
@the gears to ascension thanks for reminding me. I have come to really dislike obscurantism and layers of pointless obfuscation, but explaining also takes effort and so it is easy to fall back on tired tropes and witticisms. I want to set an example that I would prefer to follow.
In lots of philosophical teaching there is the idea that “what is on the surface is not all there is”—famously, the opening of the Dao De Jing reads “The dao that can be spoken of is not the [essence, true nature, underlying law or principle] of the dao, the name that can be named is not the [essence, …] of the name. Namelessness is the birth of everything, to name is to nurture and mother everything. Therefore the [essence, …] of lacking desire is to see the [elegance, art, beauty, wonder, miniscule details] of things, and the [essence, …] of desire is to see things at their extremes. These two are from the same root but called different things, together they are called [understanding, truth, secret, order]. Finding [understanding, …] within [understanding, …] is the key to all manner of [elegance, …].”
Similarly there are ideas in buddhism usually expressed something like “the true appearances of things are not true appearances. Therefore, they are called true appearances.” (cannot quite source this quote, possibly a misinterpretation or mishearing) The focus here is on some proposed duality between “appearance” and “essence”, which is related to the Platonic concepts of form and ideal. To make it very literal, one could find appropriate buddhist garments, imitate the motions and speech of monks, and sit for a long time daydreaming every day. Most of us would not consider this “becoming a buddhist”.
In my view the interpretation of these phrases is something like: “things that can be spoken of, imitated, or acted out are the product of an inner quality. The quality is the thing that we want to learn or avoid. Therefore, confusing the products of some inner quality with the quality itself is a mistake. One should instead seek to understand the inner quality over mere appearances.” Again, learning the wisdom of a wise judge probably does not involve buying a gavel, practicing your legal latin, or walking around in long robes.
There are similar ideas behind labelling and naming, where the context of a name is often just as important as the thing that is being named. So the words “I pronounce you man and wife...” can be spoken on a schoolyard or in a church, by a kindergartener or a priest, It is that context that imbues those words with the quality of “an actual marriage pronouncement”, which is important for determining if the speech-act of marrying two people has occurred. What I’m trying to point at here is a transposition of those ideas into the context of labelling neurons, features etc., where it may be that the context (i.e. the discarded parts) of any given activation have just as if not more information than the part we have labelled in itself. To be clear, I could very well be wrong in the specific SAE case, I just wanted to flesh out a thought I had.
Visitor: I take it you didn’t have the stern and upright leaders, what we call the Serious People, who could set an example by donning Velcro shoes themselves?
From Ratatouille:
In many ways, the work of a critic is easy. We risk very little, yet enjoy a position over those who offer up their work and their selves to our judgment. We thrive on negative criticism, which is fun to write and to read. But the bitter truth we critics must face, is that in the grand scheme of things, the average piece of junk is probably more meaningful than our criticism designating it so. But there are times when a critic truly risks something, and that is in the discovery and defense of the new. The world is often unkind to new talent, new creations. The new needs friends.
And that’s why bravery is the secret name of the nameless virtue, and seriously underrated.
[[To elaborate slightly: to go beyond pointing and sneering, to actually work to construct a better future, is very difficult. It requires breaking from social conventions, not just the social conventions you claim are “self evidently stupid” but also the ones you see as natural and right. In many ways the hardest task is not to realise what the “right choice” is, but to choose cooperate in face of your knowledge of nash equilibria.
To reach for the pareto optimal solution to a coordination game means knowing you might very well be stabbed in the back. In a world where inadequate equilibria persist the only way we get out is to be the first person to break those equilibria, and that requires you to take some pretty locally irrational actions. Sometimes choosing not to defect or to punish requires unimaginable bravery. Mere knowledge of Moloch does not save you from Moloch, only action does.]]
Any chance we can get the option on desktop to use double click to super-upvote instead of click and hold? My trackpad is quite bad and this always takes me 3-5 attempts on average. Whereas double clicking is much more reliable.
I’ve written a new post about upcoming non-LLM AI developments I am very worried about. This was inspired by the recent release of the Hierarchical Reasoning Model which made some waves on X/Twitter.
I’ve been tracking these developments for the better part of a year now, making predictions privately in my notebook. I also started and got funding for a small research project to research the AI safety implications of these developments. However, things are now developing extremely quickly. At this point if I wait until DEFINITIVE proof it will probably be too late to do anything except brace for impact.
I think I’ve just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. “Difficult problems” in decision theory are problems where the payout is determined by some function that contains a contradiction, which is then resolved by causal/evidential/functional decision theories each with their own method of cutting the Gordian knot. The classic contradiction, of course, is that “payout(x1) == 100 iff predictor(your_choice) == x1; else payout(x1) == 1000″.
Except this is not at all what makes real life decisions hard. If I am planning a business and ever get to the point where I know a function for exactly how much money two different business plans will give me, I’ve already gotten past the hard part of making a business plan. Similarly, if I’m choosing between two doors on a game show the difficulty is not that the host is a genius superpredictor who will retrocausally change the posterior goat/car distribution, but the simple fact that I do not know what is behind the doors. Almost all decision theories just skip past the part where you resolve uncertainty and gather information, which makes them effectively worthless in real life. Or, worse, they try to make the uncertainty go away: If I have 100 dollars and can donate to a local homeless shelter I know well or try and give it to a malaria net charity I don’t know a lot about, I can be quite certain the homeless shelter will not misappropriate the funds or mismanage their operation, and less so about the faceless malaria charity. This is entirely missing from the standard EA arguments for allocation of funds. Uncertainty matters.
I think unpacking that kind of feeling is valuable, but yeah it seems like you’ve been assuming we use decision theory to make decisions, when we actually use it as an upper bound model to derive principles of decisionmaking that may be more specific to human decisionmaking, or to anticipate the behavior of idealized agents, or (the distinction between CDT and FDT) as an allegory for toxic consequentialism in humans.
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility.
The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurring. The expected utility of an outcome is then something like its probability times its utility.
Other uncertainties are not covered in decision theory. E.g. 1) if you are uncertain what outcomes are possible in the first place, 2) if you are uncertain what utility to assign to a possible outcome, 3) if you are uncertain what probability to assign to a possible outcome.
I assume you are talking about some of the latter uncertainties?
As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable stuff like “can it do two digit arithmetic” and move towards more complex things like “can it output a reasonable root-cause analysis of this bug” or “can it implement this feature”. Evals then must also move away from simple multiple choice questions towards more complex models of tasks and at least partial models of things like computer systems or development environments.
At this point we can start invoking the good regulator theorem, and say that the evaluator is a regulator. It wants to produce the outcome “pass” when the joint system formed from the LLM and the world-model has some desired property (“feature has been implemented”, “bug has been fixed”). It wants to produce the outcome “fail” otherwise. It seems necessary that the test environments will need to get more and more realistic to check for features of more and more complex systems. At the limit you have things like Google’s recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks this actually tends towards a perfectly deterministic model in the style of the original good regulator theorem.
Going one level up, what we are interested in may be less properties of the task or world, but properties of the AI itself (will this AI harm the end user? is the AI honest?) At that point evals have to encode assumptions about how agents store beliefs, turn beliefs into actions etc. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an eval reflects the presence of some internal undesirable property in the weights...
Activations in LLMs are linearly mappable to activations in the human brain. Imo this is strong evidence for the idea that LLMs/NNs in general acquire extremely human like cognitive patterns, and that the common “shoggoth with a smiley face” meme might just not be accurate
Note to self: If you think you know where your unknown unknowns sit in your ontology, you don’t. That’s what makes them unknown unknowns.
If you think that you have a complete picture of some system, you can still find yourself surprised by unknown unknowns. That’s what makes them unknown unknowns.
If your internal logic has almost complete predictive power, plus or minus a tiny bit of error, your logical system (but mostly not your observations) can still be completely overthrown by unknown unknowns. That’s what makes them unknown unknowns.
You can respect unknown unknowns, but you can’t plan around them. That’s… You get it by now.
Therefore I respectfully submit that anyone who presents me with a foolproof and worked-out plan of the next ten/hundred/thousand/million years has failed to take into account some unknown unknowns.
I could feel myself instinctively disliking this argument, and I think I figured out why.
Even though the argument is obviously true, and it is here used to argue for something I agree with, I’ve historically mostly seen this argument used to argue against things I agree with. Specifically, arguing to disregard experts, and to argue that nuclear power should never be built, no matter how safe it looks. Now this explains my gut reaction, but not whether it’s a good argument.
When thinking through it, my real problem with the argument is the following. While it’s technically true, it doesn’t help locate any useful crux or resolution to a disagreement. Essentially, it naturally leads to a situation where one party estimates the unknown unknowns to be much larger than the other party, and this is the crux. To make things worse, often one party doesn’t want to argue for their estimate of the size of the unknown unknowns. But we need to estimate sizes of unknown unknowns, otherwise I can troll people with “tic-tac-toe will never be solved because of unknown unknowns”.
I therefore feel better about arguments for why unknown unknowns may be large, compared to just arguing for a positive probability of unknown unknowns. For example, society has historically been extremely chaotic when viewed at large time scales, and we have numerous examples of similar past predictions which failed because of unknown unknowns. So I have a tiny prior probability that anyone can accurately predict what society will look like far into the future.
Yeah, definitely. My main gripe where I see people disregarding unknown unknowns is a similar one to yours- people who present definite worked out pictures of the future.
Google potentially adding ads to gemini:
https://arstechnica.com/ai/2025/05/google-is-quietly-testing-ads-in-ai-chatbots/
OpenAI adds shopping to chatgpt:
https://www.wired.com/story/openai-adds-shopping-to-chatgpt/
If there’s anything the history of advertising should tell us, it is that there will be powerful optimisation pressures for persuasion being developed quietly in the background for all future model post training pipelines.
Quietly at first, then openly as people get used to it. You always want to have just slightly more ads than your competitors, because having much more could make people switch.
More bad news for optimisation pressures on AI companies: ChatGPT now has a buy product feature
https://www.wired.com/story/openai-adds-shopping-to-chatgpt/
For now they claim that all product recommendations are organic. If you believe this will last I strongly suggest you review the past twenty years of tech company evolution.
This seems like an interesting paper: https://arxiv.org/pdf/2502.19798
Essentially: use developmental psychology techniques to cause LLMs to develop a more well rounded human friendly persona that involves reflecting on their actions, while gradually escalating the moral difficulty of the dilemmas presented as a kind of phased training. I see it as a sort of cross between RLHF, CoT, and the recent work on low example count fine tuning but for moral instead of mathematical intuitions.
The feature that can be named is not the feature. Therefore, it is called the feature.
Here’s a quick mech interp experiment idea:
For a given set of labelled features from an SAE, remove the feature with a given label, then train a new classifier for that label using only the frozen SAE.
So if you had an SAE with 1000 labels, one of which has been identified as “cat”, zero out the cat feature and then train a new linear cat classifier using the remaining features, while not modifying the SAE weights. I suspect that this will be just as or more accurate than the original cat feature.
Obviously, this is most easily done using features that trigger because of a single word or cluster of related words, so that you can give easy feedback to the new linear classifier.
@the gears to ascension thanks for reminding me. I have come to really dislike obscurantism and layers of pointless obfuscation, but explaining also takes effort and so it is easy to fall back on tired tropes and witticisms. I want to set an example that I would prefer to follow.
In lots of philosophical teaching there is the idea that “what is on the surface is not all there is”—famously, the opening of the Dao De Jing reads “The dao that can be spoken of is not the [essence, true nature, underlying law or principle] of the dao, the name that can be named is not the [essence, …] of the name. Namelessness is the birth of everything, to name is to nurture and mother everything. Therefore the [essence, …] of lacking desire is to see the [elegance, art, beauty, wonder, miniscule details] of things, and the [essence, …] of desire is to see things at their extremes. These two are from the same root but called different things, together they are called [understanding, truth, secret, order]. Finding [understanding, …] within [understanding, …] is the key to all manner of [elegance, …].”
Similarly there are ideas in buddhism usually expressed something like “the true appearances of things are not true appearances. Therefore, they are called true appearances.” (cannot quite source this quote, possibly a misinterpretation or mishearing) The focus here is on some proposed duality between “appearance” and “essence”, which is related to the Platonic concepts of form and ideal. To make it very literal, one could find appropriate buddhist garments, imitate the motions and speech of monks, and sit for a long time daydreaming every day. Most of us would not consider this “becoming a buddhist”.
In my view the interpretation of these phrases is something like: “things that can be spoken of, imitated, or acted out are the product of an inner quality. The quality is the thing that we want to learn or avoid. Therefore, confusing the products of some inner quality with the quality itself is a mistake. One should instead seek to understand the inner quality over mere appearances.” Again, learning the wisdom of a wise judge probably does not involve buying a gavel, practicing your legal latin, or walking around in long robes.
There are similar ideas behind labelling and naming, where the context of a name is often just as important as the thing that is being named. So the words “I pronounce you man and wife...” can be spoken on a schoolyard or in a church, by a kindergartener or a priest, It is that context that imbues those words with the quality of “an actual marriage pronouncement”, which is important for determining if the speech-act of marrying two people has occurred. What I’m trying to point at here is a transposition of those ideas into the context of labelling neurons, features etc., where it may be that the context (i.e. the discarded parts) of any given activation have just as if not more information than the part we have labelled in itself. To be clear, I could very well be wrong in the specific SAE case, I just wanted to flesh out a thought I had.
From Inadequate Equilibria:
Visitor: I take it you didn’t have the stern and upright leaders, what we call the Serious People, who could set an example by donning Velcro shoes themselves?
From Ratatouille:
In many ways, the work of a critic is easy. We risk very little, yet enjoy a position over those who offer up their work and their selves to our judgment. We thrive on negative criticism, which is fun to write and to read. But the bitter truth we critics must face, is that in the grand scheme of things, the average piece of junk is probably more meaningful than our criticism designating it so. But there are times when a critic truly risks something, and that is in the discovery and defense of the new. The world is often unkind to new talent, new creations. The new needs friends.
And that’s why bravery is the secret name of the nameless virtue, and seriously underrated.
[[To elaborate slightly: to go beyond pointing and sneering, to actually work to construct a better future, is very difficult. It requires breaking from social conventions, not just the social conventions you claim are “self evidently stupid” but also the ones you see as natural and right. In many ways the hardest task is not to realise what the “right choice” is, but to choose cooperate in face of your knowledge of nash equilibria.
To reach for the pareto optimal solution to a coordination game means knowing you might very well be stabbed in the back. In a world where inadequate equilibria persist the only way we get out is to be the first person to break those equilibria, and that requires you to take some pretty locally irrational actions. Sometimes choosing not to defect or to punish requires unimaginable bravery. Mere knowledge of Moloch does not save you from Moloch, only action does.]]
Any chance we can get the option on desktop to use double click to super-upvote instead of click and hold? My trackpad is quite bad and this always takes me 3-5 attempts on average. Whereas double clicking is much more reliable.
I’ve written a new post about upcoming non-LLM AI developments I am very worried about. This was inspired by the recent release of the Hierarchical Reasoning Model which made some waves on X/Twitter.
I’ve been tracking these developments for the better part of a year now, making predictions privately in my notebook. I also started and got funding for a small research project to research the AI safety implications of these developments. However, things are now developing extremely quickly. At this point if I wait until DEFINITIVE proof it will probably be too late to do anything except brace for impact.
I think I’ve just figured out why decision theories strike me as utterly pointless: they get around the actual hard part of making a decision. In general, decisions are not hard because you are weighing payoffs, but because you are dealing with uncertainty.
To operationalise this: a decision theory usually assumes that you have some number of options, each with some defined payout. Assuming payouts are fixed, all decision theories simply advise you to pick the outcome with the highest utility. “Difficult problems” in decision theory are problems where the payout is determined by some function that contains a contradiction, which is then resolved by causal/evidential/functional decision theories each with their own method of cutting the Gordian knot. The classic contradiction, of course, is that “payout(x1) == 100 iff predictor(your_choice) == x1; else payout(x1) == 1000″.
Except this is not at all what makes real life decisions hard. If I am planning a business and ever get to the point where I know a function for exactly how much money two different business plans will give me, I’ve already gotten past the hard part of making a business plan. Similarly, if I’m choosing between two doors on a game show the difficulty is not that the host is a genius superpredictor who will retrocausally change the posterior goat/car distribution, but the simple fact that I do not know what is behind the doors. Almost all decision theories just skip past the part where you resolve uncertainty and gather information, which makes them effectively worthless in real life. Or, worse, they try to make the uncertainty go away: If I have 100 dollars and can donate to a local homeless shelter I know well or try and give it to a malaria net charity I don’t know a lot about, I can be quite certain the homeless shelter will not misappropriate the funds or mismanage their operation, and less so about the faceless malaria charity. This is entirely missing from the standard EA arguments for allocation of funds. Uncertainty matters.
I think unpacking that kind of feeling is valuable, but yeah it seems like you’ve been assuming we use decision theory to make decisions, when we actually use it as an upper bound model to derive principles of decisionmaking that may be more specific to human decisionmaking, or to anticipate the behavior of idealized agents, or (the distinction between CDT and FDT) as an allegory for toxic consequentialism in humans.
The theories typically assume that each choice option has a number of known mutually exclusive (and jointly exhaustive) possible outcomes. And to each outcome the agent assigns a utility and a probability. So uncertainty is in fact modelled insofar the agent can assign subjective probabilities to those outcomes occurring. The expected utility of an outcome is then something like its probability times its utility.
Other uncertainties are not covered in decision theory. E.g. 1) if you are uncertain what outcomes are possible in the first place, 2) if you are uncertain what utility to assign to a possible outcome, 3) if you are uncertain what probability to assign to a possible outcome.
I assume you are talking about some of the latter uncertainties?
Very quick thought—do evals fall prey to the Good(er) Regulator Theorem?
As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable stuff like “can it do two digit arithmetic” and move towards more complex things like “can it output a reasonable root-cause analysis of this bug” or “can it implement this feature”. Evals then must also move away from simple multiple choice questions towards more complex models of tasks and at least partial models of things like computer systems or development environments.
At this point we can start invoking the good regulator theorem, and say that the evaluator is a regulator. It wants to produce the outcome “pass” when the joint system formed from the LLM and the world-model has some desired property (“feature has been implemented”, “bug has been fixed”). It wants to produce the outcome “fail” otherwise. It seems necessary that the test environments will need to get more and more realistic to check for features of more and more complex systems. At the limit you have things like Google’s recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks this actually tends towards a perfectly deterministic model in the style of the original good regulator theorem.
Going one level up, what we are interested in may be less properties of the task or world, but properties of the AI itself (will this AI harm the end user? is the AI honest?) At that point evals have to encode assumptions about how agents store beliefs, turn beliefs into actions etc. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an eval reflects the presence of some internal undesirable property in the weights...
https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/
Activations in LLMs are linearly mappable to activations in the human brain. Imo this is strong evidence for the idea that LLMs/NNs in general acquire extremely human like cognitive patterns, and that the common “shoggoth with a smiley face” meme might just not be accurate