Independent alignment researcher
Garrett Baker
I agree with Conjecture’s reply that this reads more like a hitpiece than an even-handed evaluation.
I don’t think your recommendations follow from your observations, and such strong claims surely don’t follow from the actual evidence you provide. I feel like your criticisms can be summarized as the following:
-
Conjecture was publishing unfinished research directions for a while.
-
Conjecture does not publicly share details of their current CoEm research direction, and that research direction seems hard.
-
Conjecture told the government they were AI safety experts.
-
Some people (who?) say Conjecture’s governance outreach may be net-negative and upsetting to politicians.
-
Conjecture’s CEO Connor used to work on capabilities.
-
One time during college Connor said that he replicated GPT-2, then found out he had a bug in his code.
-
Connor has said at some times that open source models were good for alignment, then changed his mind.
-
Conjecture’s infohazard policy can be overturned by Connor or their owners.
-
They’re trying to scale when it is common wisdom for startups to try to stay small.
-
It is unclear how they will balance profit and altruistic motives.
-
Sometimes you talk with people (who?) and they say they’ve had bad interactions with conjecture staff or leadership when trying to tell them what they’re doing wrong.
-
Conjecture seems like they don’t talk with ML people.
I’m actually curious about why they’re doing 9, and further discussion on 10 and 8. But I don’t think any of the other points matter, at least to the depth you’ve covered them here, and I don’t know why you’re spending so much time on stuff that doesn’t matter or you can’t support. This could have been so much better if you had taken the research time spent on everything that wasn’t 8, 9, or 10, and used to to do analyses of 8, 9, and 10, and then actually had a conversation with Conjecture about your disagreements with them.
I especially don’t think your arguments support your suggestions that
-
Don’t work at Conjecture.
-
Conjecture should be more cautious when talking to media, because Connor seems unilateralist.
-
Conjecture should not receive more funding until they get similar levels of organizational competence than OpenAI or Anthropic.
-
Rethink whether or not you want to support conjecture’s work non-monetarily. For example, maybe think about not inviting them to table at EAG career fairs, inviting Conjecture employees to events or workspaces, and taking money from them if doing field-building.
(1) seems like a pretty strong claim, which is left unsupported. I know of many people who would be excited to work at conjecture, and I don’t think your points support the claim they would be doing net-negative research given they do alignment at Conjecture.
For (2), I don’t know why you’re saying Connor is unilateralist. Are you saying this because he used to work on capabilities?
(3) is just absurd! OpenAI will perhaps be the most destructive organization to-date. I do not think your above arguments make the case they are less organizationally responsible than OpenAI. Even having an info-hazard document puts them leagues above both OpenAI and Anthropic in my book. And add onto that their primary way of getting funded isn’t building extremely large models… In what way do Anthropic or OpenAI have better corporate governance structures than Conjecture?
(4) is just… what? Ok, I’ve thought about it, and come to the conclusion this makes no sense given your previous arguments. Maybe there’s a case to be made here. If they are less organizationally competent than OpenAI, then yeah, you probably don’t want to support their work. This seems pretty unlikely to me though! And you definitely don’t provide anything close to the level of analysis needed to elevate such hypotheses.
Edit: I will add to my note on (2): In most news articles in which I see Connor or Conjecture mentioned, I feel glad he talked to the relevant reporter, and think he/Conjecture made that article better. It is quite an achievement in my book to have sane conversations with reporters about this type of stuff! So mostly I think they should continue doing what they’re doing.
I’m not myself an expert on PR (I’m skeptical if anyone is), so maybe my impressions of the articles are naive and backwards in some way. This is something which if you think is important, it would likely be good to mention somewhere why you think their media outreach is net-negative, ideally pointing to particular things you think they did wrong rather than vague & menacing criticisms of unilateralism.
- 12 Jun 2023 20:38 UTC; 68 points) 's comment on Critiques of prominent AI safety labs: Conjecture by (EA Forum;
-
I like this post, with one exception: I don’t think putting out fires feels like putting out fires. I think it feels like you’re utterly confused, and instead of nodding your head when you explain the confusion & people try to resolve it but you don’t understand them, continuing to actively notice & chase the confusion no matter how much people decrease your status due to you not being able to understand what they’re saying. It feels far more similar to going to school wearing a clown suit than heroically putting out obvious-to-you fires.
We should expect something similar to this fiasco to happen if/when Anthropic’s responsible scaling policies tell them to stop scaling.
Outside the three major AGI labs, I’m reasonably confident no major organization is following a solid roadmap to AGI; no-one else woke up. A few LARPers, maybe, who’d utter “we’re working on AGI” because that’s trendy now. But nobody who has a gears-level model of the path there, and what its endpoint entails.
This seems pretty false. In terms of large players, there also exists Meta and Inflection AI. There are also many other smaller players who also care about AGI, and no doubt many AGI-motivated workers at three labs mentioned would start their own orgs if the org they’re currently working under shuts down.
A list of some contrarian takes I have:
-
People are currently predictably too worried about misuse risks
-
What people really mean by “open source” vs “closed source” labs is actually “responsible” vs “irresponsible” labs, which is not affected by regulations targeting open source model deployment.
-
Neuroscience as an outer alignment[1] strategy is embarrassingly underrated.
-
Better information security at labs is not clearly a good thing, and if we’re worried about great power conflict, probably a bad thing.
-
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics. Not bad in itself, but IMO this places heavy asterisks on the results they can get.
-
ML robustness research (like FAR Labs’ Go stuff) does not help with alignment, and helps moderately for capabilities.
-
The field of ML is a bad field to take epistemic lessons from. Note I don’t talk about the results from ML.
-
ARC’s MAD seems doomed to fail.
-
People in alignment put too much faith in the general factor g. It exists, and is powerful, but is not all-consuming or all-predicting. People are often very smart, but lack social skills, or agency, or strategic awareness, etc. And vice-versa. They can also be very smart in a particular area, but dumb in other areas. This is relevant for hiring & deference, but less for object-level alignment.
-
People are too swayed by rhetoric in general, and alignment, rationality, & EA too, but in different ways, and admittedly to a lesser extent than the general population. People should fight against this more than they seem to (which is not really at all, except for the most overt of cases). For example, I see nobody saying they don’t change their minds on account of Scott Alexander because he’s too powerful a rhetorician. Ditto for Eliezer, since he is also a great rhetorician. In contrast, Robin Hanson is a famously terrible rhetorician, so people should listen to him more.
-
There is a technocratic tendency in strategic thinking around alignment (I think partially inherited from OpenPhil, but also smart people are likely just more likely to think this way) which biases people towards more simple & brittle top-down models without recognizing how brittle those models are.
- ↩︎
A non-exact term
- 12 May 2024 16:40 UTC; 4 points) 's comment on quila’s Shortform by (
-
What are the actual rationality concepts LWers are basically required to understand to participate in most discussions?
I am prior to having this bar be set pretty high, like 80-100% of the sequences level. I remember years ago when I finished the sequences, I spent several months practicing everyday rationality in isolation, and only then deigned to visit LessWrong and talk to other rationalists, and I was pretty disappointed with the average quality level, and like I dodged a bullet by spending those months thinking alone rather than with the wider community.
It also seems like average quality has decreased over the years.
Predictable confusion some will have: I’m talking about average quality here. Not 90th percentile quality posters.
Can you list a concrete research path which you’re pursuing in light of this strategy? This all sounds ok in principle, but I’d bet alignment problems show up in concrete pathways.
Good AGI-notkilleveryoneism-conscious researchers should in general prioritize working at big AGI labs over working independently, for alignment-focused labs, or for academia marginally more than they currently do.
The ratio of good alignment work done at labs vs independently mostly skews toward labs
Good meaning something different from impactful here. Obviously AGI labs will pay more attention to their researchers or researchers from respectable institutions than independent researchers. Your answer should factor out such considerations.
Edit: Also normalize for quantity of researchers.
I want people to not discuss things in DMs, and discuss things publicly more. I also don’t think this is embarrassing for Quintin, or at all a public spectacle.
It’s an easy mistake to make: both things are called “AI”, after all. But you wouldn’t study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI.
My understanding is that many of those studying MNIST-classifier CNNs circa 2010 were in fact studying this because they believed similar neural-net inspired mechanisms would go much further, and would not be surprised if very similar mechanisms were at play inside LLMs. And they were correct! Such studies led to ReLU, backpropagation, residual connections, autoencoders for generative AI, and ultimately the scaling laws we see today.
If you traveled back to 2010, and you had to choose between already extant fields, having that year’s GPU compute prices and software packages, what would you study to learn about LLMs? Probably neural networks in general, both NLP and image classification. My understanding is there was & is much cross-pollination between the two.
Of course, maybe this is just a misunderstanding of history on my part. Interested to hear if my understanding’s wrong!
To my ears it sounded like Shane’s solution to “alignment” was to make the models more consequentialist. I really don’t think he appreciates most of the difficulty and traps of the problems here. This type of thinking, on my model of their models, should make even alignment optimists unimpressed, since much of the reason for optimism lies in observing current language models, and interpreting their outputs as being nonconsequentialist, corrigible, and limited in scope yet broad in application.
This was a really great Petrov day thing-to-do. I clicked on the link basically instinctively, after having a panic about being under time pressure, and about having my preferred outcome not taken, and therefore not being able to think so well about what my actions would actually do. Immediately after clicking, I felt very good, and like I had made the right decision, and then I felt a sinking sensation in my gut as I realized that if I had not taken any action, and nobody else had taken any action, then in a year I’d get to see what LessWrong as a collective decided, and I realized that actually I really wanted to see the result of that Petrov day, even if it meant Petrov day wouldn’t be about preventing the end of the world.
For the rest of that day, and into today, I could not be in the world where LessWrong both collectively voted on what they wanted, and collectively refused to overrule the majority. It felt like I had ruined Petrov day. Even if others had still clicked the link in my stead, I still didn’t want to have been among the people who ruined Petrov day, and it would feel even worse if I was the first, which I place nontrivial probability on (according to Firefox’s history, and my LessWrong message history, I received the message at 6:49 PM, and closed the tab the link opened at 6:51 PM, so it does indeed seem like I was the first one to click the link).
I’ve been feeling really terrible about this since doing it, imagining how sad people will feel when they see that someone has clicked this year’s instantiation of the button.
While feeling bad about it, I could also feel my mind trying to come up with excuses for why it isn’t so bad. The following are interesting rationalizations I’ve come up with:
Unilateralism to save the world is in the Petrov day spirit. Petrov was in fact a unilateralist, and he decided contrary to his legitimate government that he would overrule his country, and refuse to report incoming missiles.
The LessWrong team are the ones ultimately in charge of LessWrong. They set up a democratic process with a unilateralist element. Their word on their justifications should be taken as literal, if they believe allowing for the option to overrule the majority will better the kind of aggregation of will they’re aiming for to decide next year’s virtue, then I should use the special power they gave me in the way they described they wanted me to use it.
But both fell when I just imagined the world I created, and felt sad because I wouldn’t see the Petrov day decided by the communal process that I liked better than my own personal decisions.
My mind went through other processes, among them being my mind attempting to blame the LessWrong team for giving me this choice (to be clear: I do not blame them, and actually thank them), wanting to swear off my own ability to make unilateralist decisions on anything forever & always getting a second opinion on anything I could do that significantly affects anyone’s life (seems overkill, but lesson learned about the kind of situations my mind is bad at thinking under!), trying to blame things on the fact that this year’s button was not a button but a poorly labeled link (true buttons are not universally labeled as such, and I did have the hint that it was Petrov day, which future people won’t. Also the link wasn’t so poorly labeled), and blaming my own terrible reading comprehension (situationally induced panic that I should have seen ahead of time for some reason is a less self-serving label to put on the situation than generally being bad at reading comprehension in all situations).
Probably this was the most emotional experience I’ve felt around LessWrong’s Petrov day, and I learned a lot about a particular state my mind can get into for which I should not make actions, and about how the world may likely end. Perhaps in a similar circumstance one day, where three different people across the world have the option to click their own links and buttons, but not enough space or time to make their minds actually think about the actual worlds they’re making possible or impossible. Or maybe two of them can actually think, but the third has the kind of mind that panics in that particular way when under that particular pressure who then clicks their button, and maybe lives long enough to regret their decision once they get to think about it.
Some of the experience does get taken away knowing my group was not in fact the minority, and there was no absolute plans to change next year’s Petrov day plans.
This all sounds bad, but I really did love the experience. And you don’t need to worry that you distracted me from work I could have otherwise done! I worked double-time in order to try to distract myself from the issue!
I’ve noticed a similar neglect of obvious first order effects too. Intuitively the signalling and counter-signalling explanations seem the most likely to be correct to me.
This reminds me of the following popular meme template:
In this template you have very smart people and very dumb people coming to the same conclusion by using similar simple arguments, while the moderately intelligent come up with complex, clever, but ultimately wrong arguments for why the obvious policy choice is incorrect. Here, I’ve made a version with the very dumb & very smart arguing that death is bad using the obvious argument that “it kills people”, while the moderately intelligent person argues that death is good because of various commonly argued second-order pro-death arguments.
They’re not going to produce stellar scientific discoveries where they autonomously invent whole new fields or revolutionize technology.
I disagree with this, and I think you should too, even considering your own views. For example, DeepMind recently discovered 2.2 million new crystals, increasing the number of stable crystals we know about by an order of magnitude. Perhaps you don’t think this is revolutionary, but 5, 10, 15, 50 more papers like it? One of them is bound to be revolutionary.
Maybe you don’t think this is autonomous enough for you. After all its people writing the paper, people who will come up with the ideas of what to use the materials for, and people who built this very particular ML setup in the first place. But then your prediction becomes these tasks will not be automateable by LLMs without making them dangerous. To me these tasks seem pretty basic, likely beyond current LLM abilities, but GPT-5 or 6? Not out of the question given no major architecture or training changes.
(edit note: last sentence was edited in)
A strange effect: I’m using a GPU in Russia right now, which doesn’t have access to copilot, and so when I’m on vscode I sometimes pause expecting copilot to write stuff for me, and then when it doesn’t I feel a brief amount of the same kind of sadness I feel when a close friend is far away & I miss them.
They could have
-
Not developed it for Microsoft.
-
Developed it for Microsoft, but insisted on proper safety.
-
Not signed up for whatever deal would allow Microsoft to force it to not do one or two without sufficient alignment checks.
-
Suppose our model has the following format:
where are matrix multiplies, and is our nonlinear layer.
We also define a sparsity measure to minimize, chosen for the fun property that it really really really likes zeros compared to almost all other numbers.
note that lower sparsity according to this measure means more zeros.
There are two reasonable ways of finding the right rotations. I will describe one way in depth, and the other way not-so in depth. Do note that the specifics of all this may change once I run a few experiments to determine whether there’s any short-cuts I’m able to take[1].
We know the input is in a preferred basis. In our MNIST case, it is just the pixels on the screen. These likely interact locally because the relevant first-level features are local. If you want to find a line in the bottom right, you don’t care about the existence of white pixels in the top left.
We choose our first rotation so as to minimize
where
Then the second rotation so as to minimize
where
And finally choosing so as to minimize
where
.
The other way of doing this is to suppose the output is in a preferred basis, instead of the input.
Currently I’m doing this minimization using gradient descent (
lr = 0.0001
), and parameterizing my rotation matrices using the fact that if is an antisymmetric matrix[2], then is a rotation matrix, and that you can make an antisymmetric matrix by choosing any old matrix , then doing . So we just figure out which gets us an which has the properties we like.There is probably a far, far better way of solving this, other than gradient descent. If you are interested in the specifics, you may know a better way. Please, please tell me a better way!
- ^
An example of a short cut: I really don’t want to find a rotation which minimizes average sparsity across every input directly. This sounds very computationally expensive! Does minimizing my sparsity metric on a particular input, or only a few inputs generalize to minimizing the sparsity metric on many inputs?
- ^
Meaning its a symmetric matrix with its top right half the opposite sign as it’s bottom left half.
- ^
These all seem like reasonable heuristics for a first-pass judgement of a field, but once you’ve actually engaged with the arguments you should either be finding actual real disagreements about assumptions/argument-validity or not, and if not you should update a bunch to ‘it looks like they may be right’. You can’t expect everyone to actually engage with your field, but I will certainly ask people to, and also continue to get annoyed when the above arguments are given as slam-dunks which people think can substitute for actually engaging with advocates.
Good on you for collecting them here.
We should expect something similar to this fiasco to happen if/when Anthropic’s oversight board tries to significantly exercise their powers.