If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
“don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.
It might be important for AI strategy to track approx how many people have daily interactions with AI boyfriends / girlfriends. Or, in more generalized form, how many people place a lot of emotional weight and trust in AIs (& which ones they trust, and on what topics).
This could be a major vector via which AIs influence politics, get followers to do things for them, and generally cross the major barrier. The AIs could be misaligned & scheming, or could be acting as tools of some scheme-y humans, or somewhere in between.
(Here I’m talking about AIs which have many powerful capabilities, but aren’t able to act on the world themselves e.g. via nanotechnology or robot bodies — this might happen for a variety of reasons.)
A Berkeley professor speculates that LLMs are doing something more like “copying the human mind” than “learning from the world.” This seems like it would imply some things we already see (e.g. fuzzily, they’re “not very creative”), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I’m finding it hard to concretize this.
That is, if LLMs are trained with a simple algorithm and acquire functionality that resembles that of the mind, then their underlying algorithm should also resemble the algorithm by which the mind acquires its functionality. However, there is one very different alternative explanation: instead of acquiring its capabilities by observing the world in the same way as humans, LLMs might acquire their capabilities by observing the human mind and copying its function. Instead of implementing a learning process that can learn how the world works, they implement an incredibly indirect process for scanning human brains to construct a crude copy of human cognitive processes.
This is also discussed here. LLMs seem to do a form of imitation learning (evidence: the plateau in base LLM capabilities roughly at human level), while humans, and animals generally, predict reality more directly. The latter is not a form of imitation and therefore not bounded by the abilities of some imitated system.
LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace ‘communism’ with ‘plans to take over.’ Humans have also produced a literature on this topic. Shouldn’t we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I’m more confident in. Now an argument I’m less confident in.
‘The literature’ would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like “learning to play the persona that does X” rather than “learning to do X.” So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very ‘stuck in its ways’ i.e. very dependent on ‘traditional’ ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they’re flawed.
Per both arguments, if you’re training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
Human minds form various abstractions over our environment. These abstractions are sometimes fuzzy (too large to fit into working memory) or leaky (they can fail).
Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory) and completely non-leaky (never fail). And also the study of which abstractions can do that.
Working memory bounds isn’t super related to non-fuzzy-ness, as you can have a window which slides over context and is still rigorous at every step. Absolute local validity due to well-specifiedness of axioms and rules of inference is closer to the core.
(realised you mean that the axioms and rules of inference are in working memory, not the whole tower, retracted)
Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory)
I don’t think that’s true, every mathematical insight started out as an intuitive guess that may or may not turn out to be wrong pending the painstaking work of completing the proof. The whole proof does not fit in the working memory. It can be learned and recited but that doesn’t mean it’s all in working memory at the same time.
(This is a brainstorm-type post which I’m not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)
I don’t mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like “the good” which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like “group” or “4” whose properties are immediately apparent, and these are what’s involved in proofs. Call these fuzzy / non-fuzzy concepts.
(Philosophers often reflect on their concepts, like “the good,” and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you’re supposed to use a new definition.)
I guess I glossed over it because in applied conceptual engineering fields like code (and maybe physics? (or is this more about the fuzzyiness of the mapping to the physical world)) (or maybe even applied math sometimes), where plenty of math is done, there are always still lots of situations where the abstraction stops fitting in working memory because it’s grown too complex for most of the people who work with it to fully understand its definitions.
Also maybe I’m assuming math is gonna get like that too once AI mathematicians start to work? (And I’ve always felt like there should be a lot more automation in math than there is)
If AI turns out to be very useful for cheaply writing formally verified code, what does that do for AI control? We can now request that the untrusted AI produce along with any code it writes a spec and certificate verifying that the code matches the spec.
How bad of a position does this put the untrusted model in, when it’s trying to write code that does bad stuff? Some sub-questions towards answering that question:
How tight are the constraints that the formal verification process, at its AI-driven heights, can put on the code? Seems like once something is formally verified at all this rules out large classes of vulnerabilities an AI might exploit. Can we push this further to remove almost all attack vectors?
How easy is it to write a spec that looks good to humans, but actually contains a significant omission?
How feasible is it for the spec writer to collude with the code writer?
A good ask for frontier AI companies, for avoiding massive concentration of power, might be:
“don’t have critical functions controllable by the CEO alone or any one person alone, and check that this is still the case / check for backdoors periodically”
since this seems both important and likely to be popular.
It might be important for AI strategy to track approx how many people have daily interactions with AI boyfriends / girlfriends. Or, in more generalized form, how many people place a lot of emotional weight and trust in AIs (& which ones they trust, and on what topics).
This could be a major vector via which AIs influence politics, get followers to do things for them, and generally cross the major barrier. The AIs could be misaligned & scheming, or could be acting as tools of some scheme-y humans, or somewhere in between.
(Here I’m talking about AIs which have many powerful capabilities, but aren’t able to act on the world themselves e.g. via nanotechnology or robot bodies — this might happen for a variety of reasons.)
A Berkeley professor speculates that LLMs are doing something more like “copying the human mind” than “learning from the world.” This seems like it would imply some things we already see (e.g. fuzzily, they’re “not very creative”), and it seems like it would imply nontrivial things for what we should expect out of LLMs in the future, though I’m finding it hard to concretize this.
This is also discussed here. LLMs seem to do a form of imitation learning (evidence: the plateau in base LLM capabilities roughly at human level), while humans, and animals generally, predict reality more directly. The latter is not a form of imitation and therefore not bounded by the abilities of some imitated system.
LLMs are trained on a human-generated text corpus. Imagine an LLM agent deciding whether or not to be a communist. Seems likely (though not certain) it would be strongly influenced by the existing human literature on communism, i.e. all the text humans have produced about communism arguing its pros/cons and empirical consequences.
Now replace ‘communism’ with ‘plans to take over.’ Humans have also produced a literature on this topic. Shouldn’t we expect that literature to strongly influence LLM-based decisions on whether to take over?
This is an argument I’m more confident in. Now an argument I’m less confident in.
‘The literature’ would seem to have a stronger effect on LLMs than it does on humans. Their knowledge is more crystallized, less abstract, more like “learning to play the persona that does X” rather than “learning to do X.” So maybe at the time LLM agents have other powerful capabilities, their thinking on whether to take over will still be very ‘stuck in its ways’ i.e. very dependent on ‘traditional’ ideas from the human text corpus. It might not be able to see beyond the abstractions and arguments used in the human discourse, even if they’re flawed.
Per both arguments, if you’re training an LLM, you might want to care a lot about what its training data says about the topic of AIs taking over from humans.
Human minds form various abstractions over our environment. These abstractions are sometimes fuzzy (too large to fit into working memory) or leaky (they can fail).
Mathematics is the study of what happens when your abstractions are completely non-fuzzy (always fit in working memory) and completely non-leaky (never fail). And also the study of which abstractions can do that.
Working memory bounds isn’t super related to non-fuzzy-ness, as you can have a window which slides over context and is still rigorous at every step. Absolute local validity due to well-specifiedness of axioms and rules of inference is closer to the core.
(realised you mean that the axioms and rules of inference are in working memory, not the whole tower, retracted)
I don’t think that’s true, every mathematical insight started out as an intuitive guess that may or may not turn out to be wrong pending the painstaking work of completing the proof. The whole proof does not fit in the working memory. It can be learned and recited but that doesn’t mean it’s all in working memory at the same time.
(This is a brainstorm-type post which I’m not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)
I don’t mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like “the good” which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like “group” or “4” whose properties are immediately apparent, and these are what’s involved in proofs. Call these fuzzy / non-fuzzy concepts.
(Philosophers often reflect on their concepts, like “the good,” and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you’re supposed to use a new definition.)
Hmm you’re right, that’s a distinction.
I guess I glossed over it because in applied conceptual engineering fields like code (and maybe physics? (or is this more about the fuzzyiness of the mapping to the physical world)) (or maybe even applied math sometimes), where plenty of math is done, there are always still lots of situations where the abstraction stops fitting in working memory because it’s grown too complex for most of the people who work with it to fully understand its definitions.
Also maybe I’m assuming math is gonna get like that too once AI mathematicians start to work? (And I’ve always felt like there should be a lot more automation in math than there is)