Related question, how much do we know about models’ awareness as it pertains ‘correctness’ or ‘truth-seeking’? This paper showed that there seems to be some sort of ‘good-evil’ axis in models (massive oversimplification, but directionally correct), which even behaviors seemingly unrelated to morality find their place on. In a similar fashion, is it possible to extract a steering vector that deems truth as only what is ‘verifiable’? My thinking is that you could use a small to medium sized set of examples to determine the existence of a given circuit or vector that specifically applies in cases where the model chose to be lazy or hide the ball to get to “correct” and get rewarded, then try shifting it the opposite way to explore model behavior when it’s not reward hacking. If there’s a similar ‘false-true’ axis in models, you could measure that circuit activation while models work in RL environments and detect cases of reward hacking much easier. Naively I’d expect it to exist because the degrees of freedom in reward hacking seem much higher than actually solving the problem, i.e. solving the problem is a specific behavior that could be isolated. We don’t really know the shape of latent space, but I would hope this kind of structure exists.
(I say this as a complete layman, I don’t know if it’s possible)
(Additionally, perhaps this paper could be considered evidence against such a possibility, but it seems a little too pat. Even if you can’t isolate a single circuit, there might be a cluster or a family of activations that you can discover/build over time. I don’t fully believe autolabeling is mature technology, we could probably come up with a better way to do this.)
As a human, I can generally tell when I’m lying or not trying very hard. That being said, people do lie to themselves and occasionally feel that they are high effort when they are actually low effort. So I’m not sure there’s some kind of neat breakdown, but as a tool for model monitoring, maybe some sort of pipeline to extract certain vectors or circuits reliably for the composition of qualities which bubble up to “working hard and truthfully on a problem” would be useful to have.
sophont
Notionally speaking, what sort of options exist to outsource deployment and lab skills? If I submitted a dangerous sequence of RNA for synthesis (I have no idea what sorts of things could even be considered dangerous) or requested a plasmid that was designed by a very smart model, could those be synthesized by a third-party without safety checks and eventually become a danger?
I don’t really disagree with anything here, but it does seem to be historically well founded that nation-states only respond to a much greater standard of evidence with respect to dangerous technology. I think there’s a few conditions we need to meet.
Conditions,
A. Clear demonstration of the technology’s great danger—We dropped the bomb before we recognized that the bomb shouldn’t be dropped under any circumstance. Or in the case of the Montreal Protocol, we had unambiguous evidence that HCFCs were depleting the Ozone layer.
B. Brinksmanship using the technology in question—It took the Cuban Missile Crisis to prompt international cooperation on nuclear arms control steps for example, starting with the 1963 Partial Test Ban Treaty.
C. Public fear/horror regarding specific uses of the technology—I think the examples regarding nuclear weapons are obvious. Another case would be the Chemical Weapons Convention, which was prompted in part by news coverage of the 1988 Halabja attack.
D. No win condition—Nuclear proliferation proceeded practically unabated so long as countries viewed it as a race they could win. Von Neumann’s MAD doctrine changed the calculus surrounding the use of nuclear weapons; any use of nuclear weapons could result in an exchange that no one could win.
E. Limited controls—Even in the case of nuclear technology, nuclear material is produced and used across the economy. Controls are specific to the necessary steps to produce nuclear weapons (enrichment for example) as opposed to general use of nuclear material.
Applied with respect to AI,
A. No major events due to frontier AI just yet. Some deaths due to at-risk individuals using Helpful and Harmful AI.
B. No brinksmanship since no near peer power has similarly dangerous AI.
C. Mass distaste for AI, but not yet fear/horror. AI hasn’t been used as weapon on a large scale yet.
D. Policy leaders seem to believe that there is a win condition where they can maintain ‘control’ of superintelligent AI. No way to make them believe otherwise i.e. human hubris as a failure condition.
E. Although limited controls are on the table in the form of privacy, data use, and more usefully limiting datacenter size (not passed yet anywhere), safetyists still push for a pause until we have better alignment technology. I agree with this, but it’s an impossible thing to advocate for when none of the previous conditions have been met.
So overall, I doubt we can push for safety regulation that’s targeted towards ASI until there’s a clear and present danger—catch 22, that’s what we want to prevent. I would recommend that AI safety reorient around meeting or predicting that these conditions will be met, and having drafted legislation and countermeasures immediately ready to go whenever they are.
(Pithy comment, not to be taken seriously—unless someone at the frontier labs uses Mythos or the OpenAI equivalent to conduct a mass cyberattack against middle America, its unlikely regulation like this will even be on the table)
I have the same basic prior as you in terms of disbelieving claims of an eidetic memory, however, I could certainly believe it about Von Neumann specifically. Even if it’s extremely rare, there’s a quite a few documented cases of extraordinary memories in history (Kim Peek, Shereshevsky, etc.). Now if we assess this as a claim that he had perfect recall of 600,000 words after having read it, probably do agree with you, but the actual mechanics of this claim i.e. repeating back word for word a book probably starts at the beginning and proceeds from there—hence, this reduces to knowing the first few pages of a book really well on a single reading. He apparently learned many languages at a young age, so that lends additional evidence towards him having a high verbal intelligence of which memory is a component.
Thanks for sharing! Very interested in the passive autorecording to bugfix pipeline, that seems like a high utility approach for combatting entropy. Would be interested in similar approaches in non-software fields (some form of automatic pothole recognition, or a lab buddy that pays attention to any disorganization and suggests fixes etc.).