AI grantmaking at Open Philanthropy.
I used to give careers advice for 80,000 hours.
AI grantmaking at Open Philanthropy.
I used to give careers advice for 80,000 hours.
Task: Identify key background knowledge required to understand a concept
Context: Many people are currently self-directing their learning in order to eventually be able to useful contribute to alignment research. Even among experienced researchers, people will sometimes come across concepts that require background they don’t have in order to understand. By ‘key’ background content, I’m imagining that the things which get identified are ‘one step back’ in the chain, or something like ‘the required background concepts which themselves require the most background’. This seems like the best way of making the tool useful, if the background concepts generated are themselves not understood by the user, they can just use the tool again on those concepts.
Input type: A paper (with the idea that part of the task is to identify the highest level concepts in the paper). It would also be reasonable to just have the name of a concept, with a separate task of ‘generate the highest level concept’.
Output type: At minimum, a list of concepts which are key background. Better would be a list of these concepts plus summaries of papers/textbooks/wikipedia entries which explain them.
Info considerations: This system is not biased towards alignment over capabilities, though I think it will in practice help alignment work more than capabilities work, due to the former being less well-served by mainstream educational material and courses. This does mean that having scraped LW and the alignment forum, alignment-relevant things on ArXiv, MIRI’s site etc. would be particularly useful
I don’t have capacity today to generate instances, though I plan to come back and do so. I’m happy to share credit if someone else jumps in first and does so though!
Thanks for writing this up! I’ve found this frame to be a really useful way of thinking about GPT-like models since first discussing it.
In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the ‘other methods’ section of ‘Novel methods of process/agent specification’). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI’s current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.
It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:
Something that seems relatively benign/unexciting—fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn’t really change which simulacra are accessible to the simulator.
More worrying things—particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes—the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.
[crossposting my comment from the EA forum as I expect it’s also worth discussing here]
whether you have a 5-10 year timeline or a 15-20 year timeline
Something that I’d like this post to address that it doesn’t is that to have “a timeline” rather than a distribution seems ~indefensible given the amount of uncertainty involved. People quote medians (or modes, and it’s not clear to me that they reliability differentiate between these) ostensibly as a shorthand for their entire distribution, but then discussion proceeds based only on the point estimates.
I think a shift of 2 years in the median of your distribution looks like a shift of only a few % in your P(AGI by 20XX) numbers for all 20XX, and that means discussion of what people who “have different timelines” should do is usually better framed as “what strategies will turn out to have been helpful if AGI arrives in 2030″.
While this doesn’t make discussion like this post useless, I don’t think this is a minor nitpick. I’m extremely worried by “plays for variance”, some of which are briefly mentioned above (though far from the worst I’ve heard). I think these tend to look good only on worldviews which are extremely overconfident, and treat timelines as point estimates/extremely sharp peaks). More balanced views, even those with a median much sooner than mine, should typically realise that the EV gained in the worlds where things move quickly is not worth the expected cost in worlds where they don’t. This is in addition to the usual points about co-operative behaviour when uncertain about the state of the world, adverse selection, the unilateralist’s curse etc.
Roughly, “avoid your actions being labelled as bad by humans [or models of humans]” is not quite the same signal as “don’t be bad”.
Could you explain your model here of how outreach to typical employees becomes net negative?
The path of: [low level OpenAI employees think better about x-risk → improved general OpenAI reasoning around x-risk → improved decisions] seems high EV to me.
I think the obvious way this becomes net negative is if the first (unstated) step in the causal chain is actually false:
[People who don’t have any good ideas for making progress on alignment try to ‘buy time’ by pitching people who work at big ML labs on AI x-risk → low level OpenAI employees think better about x-risk]
A concern of mine, especially when ideas about this kind of untargeted outreach are framed as “this is the thing to do if you can’t make technical progress”, is that [low level OpenAI employees think better about x-risk] will often instead be something like [low level employees’ suspicion that the “AI doomer crowd” doesn’t really know what it’s talking about is reinforced], or [low level employee now thinks worse about x-risk].
Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:
Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn’t look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.
Make it as easy as possible to generate alignment forum posts and comments.
The rough idea here is that it’s much easier to explain an idea out loud, especially to someone who occasionally asks for clarification or for you to repeat an idea, than it is to write a clear, concise post on it. Most of the design of this would be small bits of frontend engineering, but language model capability would be useful, and several of the capabilities are things that Ought is already working on. Ideally, interacting with the tool looks like:
Researcher talks through the thing they’re thinking about. Model transcribes ideas[1], suggests splits into paragraphs[2], suggests section headings [3], generates a high level summary/abstract [4]. If researcher says “[name of model] I’m stuck”, the response is “What are you stuck on?”, and simple replies/suggestions are generated by something like this[5].
Once the researcher has talked through the ideas, they are presented with a piece which contains an abstract at the top, then a series of headed sections, each with paragraphs which rather than containing what they said at that point verbatim, contain clear and concise summaries[6] of what was actually said. Clicking on any generated heading allows the user to select from a list of generated alternatives, or write their own[7], while clicking on any paragraph allows the user to see and select from a list of other generated summaries, the verbatim transcription, and to write their own version of this paragraph.
1 can probably be achieved by just buying an off the shelf transcription bot (though you could train one if you wanted), with the most important criterion being speed. 2-4 can have data trivially generated by scraping the entire alignment forum and removing headings/summaries/abstracts/paragraph breaks. 5 I’ve generated data for below. An MVP for generating data for 6 is using the transcription software from 1 to autotranscribe AXRP and then comparing to the human-edited summary, though I think suggesting clear rephrasings (which I’ll call 6.5) might require a seperate task. 7 is just frontend design, which I suspect is doable in-house by Ought.
The ideal version of the task is decomposable into:
find the high level concepts in a paper (high level here meaning ‘high level background required’)
From a concept, generate the highest level prerequisite concepts
For a given concept, generate a definition/explanation (either by finding and summarising a paper/article, or just directly producing one)
The last of these tasks seems very similar to a few things Elicit is already doing or at least trying to do, so I’ll generate instances of the other two.
Identify some high-level concepts in a paper
Example 1
Input: This post by Nuno Sempere
Output: Suggestions for high level concepts
Counterfactual impact
Shapley Value
Funging
Leverage
Computability
Notes: In one sense the ‘obvious’ best suggestion for the above post is ‘Shapley value’, given that’s what the post is about, and it’s therefore the most central concept one might want to generate background on. I think I’d be fine with probably prefer the output above though, where there’s some list of <10 concepts. In a model which had some internal representation of the entirety of human knowledge, and purely selected the single thing with the most precursors, my (very uncertain) guess is that computability might be the single output produced, even though it’s non-central to the post and only appears in a footnote. That’s part of the reason why I’d be relatively happy for the output of this first task to roughly be ‘complicated vocabulary which gets used in the paper’
Example 2
Input: Eliciting Latent Knowledge by Mark Xu and Paul Christiano
Output: Suggestions for high level concepts
Latent Knowledge
Ontology
Bayesian Network
Imitative Generalisation
Regularisation
Indirect Normativity
Notes: This is actually a list of terms I noted down as I was reading the paper, so rather than ‘highest level’ it’s just ‘what Alex happened to think it was worth looking up’, but for illustrative purposes I think it’s fine.
Having been given a high-level concept, generate prerequisite concepts
Notes: I noticed when trying to generate background concepts here that in order to do so it was most useful to have the context of the post. This pushed me in the direction of thinking these concepts were harder to fully decompose than I had thought, and suggested that the input might need to be ‘[concept], as used in [paper]‘, rather than just [concept]. All of the examples below come from the examples above. In some cases, I’ve indicated what I expect a second layer of recursion might produce, though it seems possible that one might just want the model to recurse one or more times by default.
I found the process of generating examples really difficult, and am not happy with them. I notice that what I kept wanting to do was write down ‘high-level’ concepts. Understanding the entirety of a few high-level concepts is often close to sufficient to understand an idea, but it’s not usually necessary. With a smooth recursion UX (maybe by clicking), I think the ideal output almost invariably generates low-mid level concepts with the first few clicks. The advantages of this are that if the user recognises a concept they know they are done with that branch, and narrower concepts are easier to generate definitions for without recursing. Unfortunately, sometimes there are high level prerequisites which aren’t obviously going to be generated by recursing on the lower level ones. I don’ have a good solution to this yet.
Input: Shapley Value
Output:
Expected value
Weighted average
Elementary probability
Utility
Marginal contribution
Payoff
Agent
Fixed cost
Variable cost
Utility
Input: Computability
Output:
Computational problem
Computable function
Turing Machine
Computational complexity
Notes: I started recursing, quickly confirmed my hypothesis from earlier about this being by miles the thing with the most prerequisites, and deleted everything except what I had for ‘level 1’, which I also left unfinished before I got completely lost down a rabbithole.
Input: Bayesian Network
Output:
Probabilistic inference
Bayes’ Theorem
Probability distribution
Directed Acyclic Graph
Directed Graph
Graph (Discrete Mathematics)
Vertex
Edge
Cycle
Trail
Graph (Discrete Mathematics)
Vertex
Edge
Notes: Added a few more layers of recursion to demonstrate both that you probably want some kind of dynamic tree structure, and also also that not every prerequisite is equally ‘high level’.
Conclusions from trying to generate examples
This is a much harder, but much more interesting, problem than I’d originally expected. Which prerequisites seem most important, how narrowly to define them, and how much to second guess myself, all ended up feeling pretty intractable. I may try with some (much) simpler examples later, rather than trying to generate them from papers I legitimately found interesting. If a LLM is able to generalise the idea of ‘necessary prerequisites’ from easier concepts to harder ones, this itself seems extremely interesting and valuable.
Rob Miles’s youtube channel, see this intro. Also his video on the stop button problem for Computerphile.
- Easily accessible, entertaining, videos are low cost for many people to watch, and they often end up watching several.
Yes, https://metaculusextras.com/points_per_question
It has its own problems in terms of judging ability. But it does exist.
Prediction markets also have problems around events with very low probability, as it is unattractive to lay bet low probability events, both qualitatively (people don’t like risking a lot of money for a small reward), and also quantitatively (you can often get similar returns just investing the money, usually at lower risk).
The latter of these problems is in theory solveable by an exchange paying interest on stakes, or by using fractions of stocks as currency, but neither of these options is implemented in a major market.
I actually happen to already have taught elicit to give helpful/obvious advice (not alignment specific, but close enough given the examples were inspired by thinking that lots of the advice I give in my day job as an advisor is obvious)! You can play with it here if you have an elicit account.
Edit:
Here’s the training data
Life problem
I need to think of a research topic but I’ve only had two ideas and both of them aren’t that great.
Obvious but useful advice.
Have you tried setting a 5-minute timer and, by-the-clock, just trying to write down as many ideas as possible? This can work surprisingly well!
Life problem
I’ve been putting off writing this email, and now every time I think about it I feel bad and don’t want to do anything at all, especially write the email!
Obvious but useful advice.
This seems like a pretty urgent problem to solve, as it’s one that will just keep getting worse otherwise as you get negative reinforcement when you think about it. I have a few ideas for things to try: Can you get a friend to sit with you while you write it, or even to write it for you? If you make it your number one priority, can you get it done right now? Is there a way you can not send it, for example by just emailing to say ‘sorry, can’t reply now, will explain later’?
Life problem
I’m thinking about quitting my job in finance in order to self-study ML and switch to working on alignment. How can I make the final decision?
Obvious but useful advice.
That’s an exciting decision to be making! It might be worth writing up the pros and cons of both options in a googledoc, and sharing it with some friends with comment access enabled. Getting your thoughts sorted in a way which is clear to others might be helpful itself, and then also your friends might have useful suggestions or additional considerations!
Life problem
I’m giving a talk tomorrow, but I’m worried that I’m going to be asked a question that’s difficult to answer and then say something which seems stupid if it gets taken out of context.
Obvious but useful advice.
Remember that it’s fine to say “I don’t know”, or some variant like “I’m not sure I can give a clear enough answer to that without thinking about it for longer”, when people ask you questions! You can even write this on your hand if you need to.
Life problem
I’m a bit worried about the next week, there’s a few pretty important things happening and I’ve tried to plan as best I can for them but I’m worried it’s all going to go horribly wrong.
Obvious but useful advice.
It’s ok to be worried before important things! One option you have, as long as it wouldn’t be too stressful, is writing a premortem. This is where you think about a specific thing that could go wrong, then decide whether there’s any action you can take to prevent or mitigate that scenario, then repeat. At the end of this process, you can then say to yourself something like “I’m pretty sure I’ve taken all of the actions it’s reasonable to take”, which might be reassuring. Even if it’s not immediately emotionally reassuring, if the actions were sensible, it’s now less likely things will go wrong!
Life problem
I don’t know how to go about planning my career.
Obvious but useful advice.
My suggestion would be to write down a few of the most important things you’d like your career to look like in two or three years, and for each of them write down concrete steps which you think will make progress towards achieving that. Then you can start seeing what sorts of jobs / collaborations might be feasible. If one thing that’s important to you for your career is impact, you could check out the resources on 80000hours.org, where there’s lots of helpful guidance, and you can even book a call with an advisor.
Life problem
I’m trying to solve a difficult technical problem at work, but my mind is just blank.
Obvious but useful advice.
Do you have to solve the problem on your own? Asking a colleague or friend can be really helpful. You could even write up a description of the problem in a googledoc and share it with a few people asking them to comment with ideas. If you have to solve the problem on your own, you could try imagining a specific person who you think would be able to solve the problem, and asking yourself what they would try.
Life problem
I’ve taken on too many commitments and now I’m feeling really overwhelmed by them
Obvious but useful advice.
Remember that saying “no” is fine, there are a lot of good reasons to drop commitments, including simply not having time or being too busy. You don’t need to feel guilty about making this decision now, but when you’re feeling better it might be worth working out whether you could have noticed dropped some things earlier. If you’re finding yourself saying yes too much, it might be worth setting yourself a ‘saying no to things’ goal which you share with a friend.
Life problem
I’ve got a test next week and I don’t know how best to prepare.
Obvious but useful advice.
Do you have any practice tests you can sit and then mark to work out where you went wrong? The best way to prepare for a thing is usually just trying to do that thing!
Life problem
I’m struggling to maintain focus as I work. I find myself staring into space, and not really resting, but I feel too guilty to stop and take a break.
Obvious but useful advice.
If you’re too stressed or tired to do useful work, you should stop and rest! It’s better to fully rest and gain back some energy than keep struggling when you aren’t being productive. You could also try using the pomodoro technique of working for set periods of time and taking breaks in between.
Both 80,000hours and AI Safety Support are keen to offer personalised advice to people facing a career decision and interested in working on alignment (and in 80k’s case, also many other problems).
Noting a conflict of interest—I work for 80,000 hours and know of but haven’t used AISS. This post is in a personal capacity, I’m just flagging publicly available information rather than giving an insider take.