On the advice of @adamShimi, I recently read Hasok Chang’s Inventing Temperature. The book is terrific and full of deep ideas, many of which relate in interesting ways to AI safety. What follows are some thoughts on that relationship, from someone who is not an AI safety researcher and only somewhat follows developments there, and who probably got one or two things wrong.
(Definitions: By “operationalizing”, I mean “giving a concept meaning by describing it in terms of measurable or closer-to-measurable operations”, whereas “abstracting” means “removing properties in the description of an object”.)
There has been discussion on LessWrong about the relative value of abstract work on AI safety (e.g., agent foundations) versus concrete work on AI safety (e.g., mechanistic interpretability, prosaic alignment). Proponents of abstract work argue roughly that general mathematical models of AI systems are useful or essential for understanding risks, especially coming from not-yet-existing systems like superintelligences. Proponents of concrete work argue roughly that safety work is more relevant when empirically grounded and subjected to rapid feedback loops. (Note: The abstract-concrete distinction is similar to, but different from, the distinction between applied and basic safety research.)
As someone who has done neither, I think we need both. We need abstract work because we need to build safety mechanisms using generalizable concepts, so that we can be confident that the mechanisms apply to new AI systems and new situations. We need concrete work because we must operationalize the abstract concepts in order to measure them and apply them to actually existing systems. And finally we need work that connects the abstract concepts to the concrete concepts, to see that they are coherent and for each to justify the other.
Chang writes:
The dichotomy between the abstract and the concrete has been enormously helpful in clarifying my thinking at the earlier stages, but I can now afford to be more sophisticated. What we really have is a continuum, or at least a stepwise sequence, between the most abstract and the most concrete. This means that the operationalization of a very abstract concept can proceed step by step, and so can the building-up of a concept from concrete operations. And it may be beneficial to move only a little bit at a time up and down the ladder of abstraction.
Take for example the concept of (capacity for) corrigibility, i.e., the degree to which an AI system can be corrected or shut down. The recent alignment faking paper showed that, in experiments, Claude would sometimes “pretend” to change its behavior when it was ostensibly being trained with new alignment criteria, while not actually changing its behavior. That’s an interesting and important result. But (channeling Bridgman) we can only be confident that it applies to the concrete concept of corrigibility measured by the operations used in the experiments—we have no guarantees that it holds for some abstract corrigibility, or when corrigibility is measured using another set of operations or under other circumstances.
An interesting case study discussed in the book is the development of the abstract concept of temperature by Lord Kelvin (the artist formerly known as William Thomson) in collaboration with James Prescott Joule (of conservation of energy fame). Thomson defined his abstract temperature in terms of work and pressure (which were themselves abstract and needed to be operationalized). He based his definition on the Carnot cycle, an idealized process performed by the theoretical Carnot heat engine. The Carnot heat engine was inspired by actual heat engines, but was fully theoretical—there was no physical Carnot heat engine that could be used in experiments. In other words, the operationalization of temperature that Thomson invented using the Carnot cycle was an intermediate step that required further operationalization before Thomson’s abstract temperature could be connected with experimental data. Chang suggests that, while the Carnot engine was never necessary for developing an abstract concept of temperature, it did help Thomson achieve that feat.
Ok, back to AI safety. So above I said that, for the whole AI thing to go well, we probably need progress on both abstract and concrete AI safety concepts, as well as work to bridge the two. But where should research effort be spent on the margin?
You may think abstract work is useless because it has no error-correcting mechanism when it is not trying to, or is not close to being able to, operationalize its abstract concepts. If it is not grounded in any measurable quantities, it can’t be empirically validated. On the other hand, many abstract concepts (such as corrigibility) still make sense today and are currently being studied in the concrete (though they have not yet been connected to fully abstract concepts) despite being formulated before AI systems looked much like they do today.
You may think concrete work is useless because AI changes so quickly that the operations used to measure things today will soon be irrelevant, or more pertinently perhaps, because the superintelligent systems we truly need to align are presumably vastly different from today’s AI systems, in their behavior if not in their architecture. In that way, AI is quite different from temperature. The physical nature of temperature is constant in space and time—if you measure temperature with a specific set of operations (measurement tools and procedures), you would expect the same outcomes regardless of which century or country you do it in—whereas the properties of AI change rapidly over time and across architectures. On the other hand, timelines seem short, such that AGI may share many similarities with today’s AI systems, and it is possible to build abstractions gradually on top of concrete operations.
There is in fact an example from the history of thermometry of extending concrete concepts to new environments without recourse to abstract concepts. In the 18th century, scientists realized that the mercury and air thermometers used then behaved very differently, or could not be used at all due to freezing and melting, for very low and very high temperatures. While they had an intuitive notion that some abstract temperature ought to apply across all degrees of heat or cold, their operationalized temperatures clearly only applied to a limited range of heat and cold. To solve this, they eventually developed different sets of operations for the measurement of temperatures in extreme ranges. For example, Josiah Wedgwood measured very high temperatures in ovens by baking standardized clay cylinders and measuring how much they’d shrunk. These different operations, which yielded measurements of temperature on different scales, were then connected by measuring temperature for both scales (using different operations) in an overlapping range and lining those up. All this was done without an abstract theory of temperature, and while the resulting scale was not on very solid theoretical ground, it was good enough to provide practical value.
Of course, the issue with superintelligence is that, because of e.g., deceptive alignment and gradient hacking, we want trustworthy safety mechanisms and alignment techniques in place well before the system has finished training. That’s why we want to tie those techniques to abstract concepts which we are confident will generalize well. But I have no idea what the appropriate resource allocation is across these different levels of abstraction.[1] Maybe what I want to suggest is that abstract and concrete work is complementary and should strive towards one another. But maybe that’s what people have been doing all along?
The most upvoted dialogue topic on an October 2023 post by Ben Pace was “Prosaic Alignment is currently more important to work on than Agent Foundations work”, which received 40 agree and 32 disagree votes, suggesting that the general opinion on LessWrong at that time was that the current balance was about right, or that prosaic alignment should get some more resources on the margin. ↩︎
On the advice of @adamShimi, I recently read Hasok Chang’s Inventing Temperature. The book is terrific and full of deep ideas, many of which relate in interesting ways to AI safety. What follows are some thoughts on that relationship, from someone who is not an AI safety researcher and only somewhat follows developments there, and who probably got one or two things wrong.
(Definitions: By “operationalizing”, I mean “giving a concept meaning by describing it in terms of measurable or closer-to-measurable operations”, whereas “abstracting” means “removing properties in the description of an object”.)
There has been discussion on LessWrong about the relative value of abstract work on AI safety (e.g., agent foundations) versus concrete work on AI safety (e.g., mechanistic interpretability, prosaic alignment). Proponents of abstract work argue roughly that general mathematical models of AI systems are useful or essential for understanding risks, especially coming from not-yet-existing systems like superintelligences. Proponents of concrete work argue roughly that safety work is more relevant when empirically grounded and subjected to rapid feedback loops. (Note: The abstract-concrete distinction is similar to, but different from, the distinction between applied and basic safety research.)
As someone who has done neither, I think we need both. We need abstract work because we need to build safety mechanisms using generalizable concepts, so that we can be confident that the mechanisms apply to new AI systems and new situations. We need concrete work because we must operationalize the abstract concepts in order to measure them and apply them to actually existing systems. And finally we need work that connects the abstract concepts to the concrete concepts, to see that they are coherent and for each to justify the other.
Chang writes:
Take for example the concept of (capacity for) corrigibility, i.e., the degree to which an AI system can be corrected or shut down. The recent alignment faking paper showed that, in experiments, Claude would sometimes “pretend” to change its behavior when it was ostensibly being trained with new alignment criteria, while not actually changing its behavior. That’s an interesting and important result. But (channeling Bridgman) we can only be confident that it applies to the concrete concept of corrigibility measured by the operations used in the experiments—we have no guarantees that it holds for some abstract corrigibility, or when corrigibility is measured using another set of operations or under other circumstances.
An interesting case study discussed in the book is the development of the abstract concept of temperature by Lord Kelvin (the artist formerly known as William Thomson) in collaboration with James Prescott Joule (of conservation of energy fame). Thomson defined his abstract temperature in terms of work and pressure (which were themselves abstract and needed to be operationalized). He based his definition on the Carnot cycle, an idealized process performed by the theoretical Carnot heat engine. The Carnot heat engine was inspired by actual heat engines, but was fully theoretical—there was no physical Carnot heat engine that could be used in experiments. In other words, the operationalization of temperature that Thomson invented using the Carnot cycle was an intermediate step that required further operationalization before Thomson’s abstract temperature could be connected with experimental data. Chang suggests that, while the Carnot engine was never necessary for developing an abstract concept of temperature, it did help Thomson achieve that feat.
Ok, back to AI safety. So above I said that, for the whole AI thing to go well, we probably need progress on both abstract and concrete AI safety concepts, as well as work to bridge the two. But where should research effort be spent on the margin?
You may think abstract work is useless because it has no error-correcting mechanism when it is not trying to, or is not close to being able to, operationalize its abstract concepts. If it is not grounded in any measurable quantities, it can’t be empirically validated. On the other hand, many abstract concepts (such as corrigibility) still make sense today and are currently being studied in the concrete (though they have not yet been connected to fully abstract concepts) despite being formulated before AI systems looked much like they do today.
You may think concrete work is useless because AI changes so quickly that the operations used to measure things today will soon be irrelevant, or more pertinently perhaps, because the superintelligent systems we truly need to align are presumably vastly different from today’s AI systems, in their behavior if not in their architecture. In that way, AI is quite different from temperature. The physical nature of temperature is constant in space and time—if you measure temperature with a specific set of operations (measurement tools and procedures), you would expect the same outcomes regardless of which century or country you do it in—whereas the properties of AI change rapidly over time and across architectures. On the other hand, timelines seem short, such that AGI may share many similarities with today’s AI systems, and it is possible to build abstractions gradually on top of concrete operations.
There is in fact an example from the history of thermometry of extending concrete concepts to new environments without recourse to abstract concepts. In the 18th century, scientists realized that the mercury and air thermometers used then behaved very differently, or could not be used at all due to freezing and melting, for very low and very high temperatures. While they had an intuitive notion that some abstract temperature ought to apply across all degrees of heat or cold, their operationalized temperatures clearly only applied to a limited range of heat and cold. To solve this, they eventually developed different sets of operations for the measurement of temperatures in extreme ranges. For example, Josiah Wedgwood measured very high temperatures in ovens by baking standardized clay cylinders and measuring how much they’d shrunk. These different operations, which yielded measurements of temperature on different scales, were then connected by measuring temperature for both scales (using different operations) in an overlapping range and lining those up. All this was done without an abstract theory of temperature, and while the resulting scale was not on very solid theoretical ground, it was good enough to provide practical value.
Of course, the issue with superintelligence is that, because of e.g., deceptive alignment and gradient hacking, we want trustworthy safety mechanisms and alignment techniques in place well before the system has finished training. That’s why we want to tie those techniques to abstract concepts which we are confident will generalize well. But I have no idea what the appropriate resource allocation is across these different levels of abstraction.[1] Maybe what I want to suggest is that abstract and concrete work is complementary and should strive towards one another. But maybe that’s what people have been doing all along?
The most upvoted dialogue topic on an October 2023 post by Ben Pace was “Prosaic Alignment is currently more important to work on than Agent Foundations work”, which received 40 agree and 32 disagree votes, suggesting that the general opinion on LessWrong at that time was that the current balance was about right, or that prosaic alignment should get some more resources on the margin. ↩︎