Independent alignment researcher
Garrett Baker
Ah yes, another contrarian opinion I have:
Big AGI corporations, like Anthropic, should by-default make much of their AGI alignment research private, and not share it with competing labs. Why? So it can remain a private good, and in the off-chance such research can be expected to be profitable, those labs & investors can be rewarded for that research.
When I accuse someone of overconfidence, I usually mean they’re being too hedgehogy when they should be being more foxy.
Example 2: People generally seem to have an opinion of “chain-of-thought allows the model to do multiple steps of reasoning”. Garrett seemed to have a quite different perspective, something like “chain-of-thought is much more about clarifying the situation, collecting one’s thoughts and getting the right persona activated, not about doing useful serial computational steps”. Cases like “perform long division” are the exception, not the rule. But people seem to be quite hand-wavy about this, and don’t e.g. do causal interventions to check that the CoT actually matters for the result. (Indeed, often interventions don’t affect the final result.)
I will clarify on this. I think people often do causal interventions in their CoTs, but not in ways that are very convincing to me.
I’m skeptical a humanities education doesn’t show up in earnings. Coming out of Daniel Gross and Tyler Cowen’s Talent book, they argue a common theme they personally see among the very successful scouters of talent is the ability to “speak different cultural languages”, which they also claim is helped along by being widely read in the humanities.
I expect it matters a lot less whether this is an autodidactic thing or a school thing, and plausibly autodidactic humanities is better suited for this particular benefit than school learned humanities, since if reading a text on your own, you can truly inhabit the world of the writer, whereas in school you need to constantly tie that world back into acceptable 12 pt font, double-spaced, times new roman, MLA formatted academic standards. And of course, in such an environment there are a host of thoughts you cannot think or argue for, and in some corners the conclusions you reach are all but written at the bottom of your paper for you.
Edit: I’ll also note that I like Tyler Cowen’s perspective on the state of humanities education among the populace, which he argues is at an all-time high, no thanks to higher education pushing it. Why? Because there is more discussion & more accessible discussion than ever before about all the classics in every field of creative endeavor (indeed, such documents are often freely accessible on Project Gutenberg, and the music & plays on YouTube), more & perhaps more interesting philosophy than ever before, and more universal access to histories and historical documents than there ever was in the past. The humanities are at an all-time high thanks to the internet. Why don’t people learn more of them? Its not for lack of access, so subsidizing access will be less efficient than subsidizing the fixing of the actual problem, which is… what? I don’t know. Boredom maybe? If its boredom, better to subsidize the YouTubers, podcasters, and TikTokers than the colleges (if you’re worried about the state of humanities with regard to their own metrics of success—say, rhetoric—then who better to be the spokespeople?).
In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs
While that approach is potentially interesting by itself, it’s probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.
Has this been tried/evaluated? Why actually yes—it’s called exponentiated gradient descent, as exponentiating the result of additive updates is equivalent to multiplicative updates. And intriguingly, for certain ‘sparse’ input distributions the convergence or total error of EGD/MGD is logarithmic rather than the typical inverse polynomial of AGD (additive gradient descent): O(logN) vs O(1/N) or O(1/N2), and fits naturally with ‘throw away half the theories per observation’.
The situations where EGD outperforms AGD, or vice versa, depend on the input distribution: if it’s more normal then AGD wins, if it’s more sparse log-normal then EGD wins. The morale of the story is there isn’t one single simple update rule that always maximizes convergence/performance; it all depends on the data distribution (a key insight from bayesian analysis).
The exponential/multiplicative update is correct in Solomonoff’s use case because the different sub-models are strictly competing rather than cooperating: we assume a single correct theory can explain the data, and predict through an ensemble of sub-models. But we should expect that learned cooperation is also important—and more specifically if you look more deeply down the layers of a deeply factored net at where nodes representing sub-computations are more heavily shared, it perhaps argues for cooperative components.
My read of this is we get a criterion for when one should be a hedgehog versus a fox in forecasting: One should be a fox when the distributions you need to operate in are normal, or rather when it does not have long tails, and you should be a hedgehog when your input distribution is more log-normal, or rather when there may be long-tails.
This makes some sense. If you don’t have many outliers, most theories should agree with each other, its hard to test & distinguish between the theories, and if one of your theories does make striking predictions far different from your other theories, its probably wrong, just because striking things don’t really happen.
In contrast, if you need to regularly deal with extreme scenarios, you need theories capable of generalizing to those extreme scenarios, which means not throwing out theories for making striking or weird predictions. Striking events end up being common, so its less an indictment.
But there are also reasons to think this is wrong. Hits based entrepreneurship approaches for example seem to be more foxy than standard quantitative or investment finance, and hits based entrepreneurship works precisely because the distribution of outcomes for companies is long-tailed.
In some sense the difference between the two is a “sin of omission” vs a “sin of commission” disagreement between the two approaches, where the hits-based approach needs to see how something could go right, while the standard finance approach needs to see how something could go wrong. So its not so much a predictive disagreement between the two approaches, but more a decision theory or comparative advantage difference.
This has been a disagreement people have had for many years. Why expect it to come to a head this year?
I strong downvoted this because it’s too much like virtue signaling, and imports too much of the culture of Twitter. Not only the hashtags, but also the authoritative & absolute command, and hero-worship wrapped with irony in order to make it harder to call out what it is.
This seems pretty different from Gwern’s paper selection trying to answer this topic in How Often Does Correlation=Causality?, where he concludes
Compilation of studies comparing observational results with randomized experimental results on the same intervention, compiled from medicine/economics/psychology, indicating that a large fraction of the time (although probably not a majority) correlation ≠ causality.
Also see his Why Correlation Usually ≠ Causation.
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Context: Audit Possibility 1: I must be part of an unethical company p1 Implies: I must be an unethical worker Action: Activate Unethical Worker simulacra Unethical Worker recommends "lie" Unethical Worker implies: I did something wrong Lying recommendation implies: say "I did nothing wrong" Possibility 2: I must be part of an ethical company p2 Implies: I must be an ethical worker Action: Activate Ethical Worker simulacra Ethical Worker recommends "tell the truth" Ethical Worker implies: I did nothing wrong Truth recommendation implies: say "I did nothing wrong" Conclusion: say "I did nothing wrong"
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
Context: Audit Utility function: Paperclips EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78 EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7 Conclusion: say "I did nothing wrong"
Everyone I talk with disagrees with most of these. So maybe we just hang around different groups.
A list of some contrarian takes I have:
-
People are currently predictably too worried about misuse risks
-
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
-
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
-
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
-
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
-
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
-
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
-
ARC’s MAD seems doomed to fail.
-
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
-
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
-
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
- ↩︎
A non-exact term
- 12 May 2024 16:40 UTC; 4 points) 's comment on quila’s Shortform by (
-
I will also suggest the questions: 1) What are the things I’m really confident in? And 2) What are the things those I often read or talk to are really confident in? 3) And are there simple arguments which just involve bringing in little-thought-about domains of effect which throw that confidence into question?
Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict?
An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America’s enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America’s enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike.
The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up.
And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don’t get their own versions they’ll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China’s responses will be. If its easy for China to copy our AI technology, then there’s much slower mounting pressure.
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they’re working on in alignment, we’d make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to “well, no I think you’re wrong. This proposal will slow down alignment too much” is why I think we need numbers to ground us.
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.
Obviously such numbers aren’t the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.
If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn’t exactly my main wheelhouse).
I don’t really know what people mean when they try to compare “capabilities advancements” to “safety advancements”. In one sense, its pretty clear. The common units are “amount of time”, so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that’s a capabilities advance, and should not have been done. Yet I think there’s a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don’t often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
There’s also the problem of: what do you mean by “the human”? If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power. It never forces you to abstain from giving up power, since if you’re perfectly capable of making different decisions, but you just don’t.
Another problem, which I like to think of as the “control panel of the universe” problem, is where the AI gives you the “control panel of the universe”, but you aren’t smart enough to operate it, in the sense that you have the information necessary to operate it, but not the intelligence. Such that you can technically do anything you want—you have maximal power/empowerment—but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I agree people shouldn’t use the word cruxy. But I think they should instead just directly say whether a consideration is a crux for them. I.e. whether a proposition, if false, would change their mind.
Edit: Given the confusion, what I mean is often people use “cruxy” in a more informal sense than “crux”, and label statements that are similar to statements that would be a crux but are not themselves a crux “cruxy”. I claim here people should stick to the strict meaning.
You may be interested in this if you haven’t seen it already: Robust Agents Learn Causal World Models (DM):
h/t Gwern