John Schulman

Karma: 461

John Schulman 6 Jun 2022 16:39 UTC
LW: 58 AF: 24
0
AF
on: AGI Ruin: A List of Lethalities
Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)
- Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
- One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent’s world model and tactics will be automatically fixed by reasoning and data, but its inner objective won’t be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent’s inner objective when training on easy problems, when the agent isn’t very capable, such that this objective remains stable as the agent becomes more powerful.
  Also, there’s empirical evidence that alignment generalizes surprisingly well: several thousand instruction following examples radically improve the aligned behavior on a wide distribution of language tasks (InstructGPT paper) a prompt with about 20 conversations gives much better behavior on a wide variety of conversational inputs (HHH paper). Making a contemporary language model well-behaved seems to be much easier than teaching it a new cognitive skill.
- Human raters make systematic errors—regular, compactly describable, predictable errors.… This is indeed one of the big problems of outer alignment, but there’s lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it’s easier to verify proofs than to write them. Obviously alignment isn’t about writing proofs, but the general principle does apply. You can reduce “behaving well” to “answering questions truthfully” by asking questions like “did the agent follow the instructions in this episode?”, and use those to define the reward function. These questions are not formulated in formal language where verification is easy, but there’s reason to believe that verification is also easier than proof-generation for informal arguments.

John Schulman 15 May 2022 19:40 UTC
26 points
on: Is AI Progress Impossible To Predict?
Interesting analysis. Have you tried doing an analysis on quantities other than % improvement? A 10% improvement from low accuracy is different from a 10% improvement at high accuracy. So for example, you could try doing a linear regression from small_to_medium_improvement, medium_accuracy → large_accuracy and look at the variance explained.
Edit: I tried linear regression on the chinchilla MMLU data, predicting the large model accuracy from the 3 smaller models’ accuracies, and only got 8% of variance explained, vs 7% of variance explained by only looking at the second largest model’s accuracy. So that’s consistent with the OP’s claim of unpredictability.
Edit2: MMLU performance for the smaller models is about chance level, so it’s not surprising that we can’t predict much from it. (The accuracies we’re looking at for these models are noise.)

John Schulman 25 Aug 2022 20:50 UTC
LW: 25 AF: 15
22
AF
on: A Test for Language Model Consciousness
I think this test can be performed now or soon, but I’m not sure I’d update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. (“You are a transformer language model trained by AICo with data up to 2022/04”). We could also bake in this information through fine-tuning. They won’t be able to tell you how many layers they have without being told, but we humans can’t determine our brain architecture through introspection either.
I think the answer to “are you phenomenally conscious” will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don’t know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It’s harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)

John Schulman 26 Apr 2022 3:14 UTC
LW: 23 AF: 10
AF
on: Intuitions about solving hard problems
Weight-sharing makes deception much harder.
Could you explain or provide a reference for this?

John Schulman 30 May 2022 0:39 UTC
LW: 18 AF: 11
AF
on: Reshaping the AI Industry
IMO prosaic alignment techniques (say, around improving supervision quality through RRM & debate type methods) are highly underrated by the ML research community, even if you ignore x-risk and just optimize for near-term usefulness and intellectual interestingness. I think this is due to a combination of (1) they haven’t been marketed well to the ML community, (2) lack of benchmarks and datasets, (3) need to use human subjects in experiments, (4) it takes a decent amount of compute, which was out of reach, perhaps until recently.

John Schulman 4 Jan 2021 4:38 UTC
LW: 17 AF: 11
AF
on: Multi-dimensional rewards for AGI interpretability and control
There’s a decent amount of literature on using multiple rewards, though often it’s framed as learning about multiple goals. Here are some off the top of my head:
The Horde (classic): http://www.ifaamas.org/Proceedings/aamas2011/papers/A6_R70.pdf
Universal Value Function Approximators: http://proceedings.mlr.press/v37/schaul15.html
Learning to Act By Predicting: https://arxiv.org/abs/1611.01779
Temporal Difference Models: https://arxiv.org/abs/1802.09081
Successor Features: https://papers.nips.cc/paper/2017/hash/350db081a661525235354dd3e19b8c05-Abstract.html

Also see the discussion in Appendix D about prediction heads in OpenAI Five, used mostly for interpretability/diagnostics https://cdn.openai.com/dota-2.pdf.
What links here?
- Model-based RL, Desires, Brains, Wireheading by Steven Byrnes (14 Jul 2021 15:11 UTC; 22 points)

John Schulman 9 Jan 2021 20:08 UTC
LW: 16 AF: 12
AF
on: The Case for a Journal of AI Alignment
I think this is a good idea. If you go ahead with it, here’s a suggestion.
Reviewers often procrastinate for weeks or months. This is partly because doing a review takes an unbounded amount of time, especially for articles that are long or confusing. So instead of sending the reviewers a manuscript with a due date, book a calendar event for 2 hours with the reviewers. The reviewers join a call or group chat and read the paper and discuss it. They can also help clear each other’s confusions. They aim to complete the review by the end of the time window.

John Schulman 7 Jun 2022 0:51 UTC
LW: 13 AF: 8
−6
AF
in reply to: Eliezer Yudkowsky’s comment on: AGI Ruin: A List of Lethalities
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Do alignment & safety research, set up regulatory bodies and monitoring systems.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Not sure exactly what this means. I’m claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

John Schulman 9 Jun 2021 15:46 UTC
LW: 11 AF: 7
AF
on: “Decision Transformer” (Tool AIs are secret Agent AIs)
Basically agree—I think that a model trained by maximum likelihood on offline data is less goal-directed than one that’s trained by an iterative process where you reinforce its own samples (aka online RL), but still somewhat goal directed. It needs to simulate a goal-directed agent to do a good job at maximum likelihood. OTOH it’s mostly concerned with covering all possibilities, so the goal directed reasoning isn’t emphasized. But with multiple iterations, the model can improve quality (-> more goal directedness) at the expense of coverage/diversity.

John Schulman 8 Jun 2021 22:19 UTC
LW: 11 AF: 10
AF
on: The case for aligning narrowly superhuman models
Super clear and actionable—my new favorite post on AF.
I also agree with it, and it’s similar to what we’re doing at OpenAI (largely thanks to Paul’s influence).

John Schulman 30 Dec 2020 23:31 UTC
LW: 8 AF: 7
AF
on: Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian
The results in Neural Networks Are Fundamentally Bayesian are pretty cool—that’s clever how they were able to estimate the densities.
A couple thoughts on the limitations:
- There are various priors over functions for which we can calculate the exact posterior. (E.g., Gaussian processes.) However, doing Bayesian inference on these priors doesn’t perform as well as neural networks on most datasets. So knowing SGD is Bayesian is only interesting if we also know that the prior is interesting. I think the ideal theoretical result would be to show that SGD on neural nets is an approximation of Solomonoff Induction (or something like SI), and the approximation gets better as the NNs get bigger and deeper. But I have yet to see any theory that connects neural nets/ SGD to something like short programs.
- If SGD works because it’s Bayesian, then making it more Bayesian should make it work better. But according to https://arxiv.org/abs/2002.02405 that’s not the case. Lowering the temperature, or taking the MAP (=temperature 0) generalizes better than taking the full Bayesian posterior, as calculated by an expensive MCMC procedure.

John Schulman 23 Dec 2020 18:21 UTC
LW: 8 AF: 6
AF
on: Debate update: Obfuscated arguments problem
Wonderful writeup!
I’m sure you’ve thought about this, but I’m curious why the following approach fails. Suppose we require the debaters to each initially write up a detailed argument in judge-understandable language and read each other’s argument. Then, during the debate, each debater is allowed to quote short passages from their opponent’s writeup. Honest will be able to either find a contradiction or an unsupported statement in Dishonest’s initial writeup. If Honest quotes a passage and says its unsupported, then dishonest has to respond with the supporting sentences.

John Schulman 19 Dec 2021 0:13 UTC
LW: 7 AF: 2
AF
on: Introducing the Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS) Fellowship
I’m especially interested in the analogy between AI alignment and democracy. (I guess this goes under “Social Structures and Institutions”.) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues:
- politicians optimize for the approval of low-information voters, rather than truly optimizing the people’s wellbeing (deceptive alignment)
- politician, pacs, parties, permanent bureaucrats are agents with their own goals that don’t align with the populace (mesa optimizers)
I think it’s more likely that insights will transfer from the field of AI alignment to the field of government design than vice versa. Easier to do experiments on the AI side, and clearer thinkers.

John Schulman 7 Jun 2022 1:04 UTC
LW: 5 AF: 3
−1
AF
in reply to: Vaniver’s comment on: AGI Ruin: A List of Lethalities
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)
Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn’t really have goals and isn’t good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.

John Schulman 18 Apr 2022 4:59 UTC
4 points
in reply to: Martin Sustrik’s comment on: Ideal governance (for companies, countries and more)
Lee Kuan Yew wrote about how he went looking for a governance system for his party, the PAP (which now rules Singapore) after the party nearly was captured by the communists in the 50s. He looked at the Catholic Church as an inspiring example of a system that had survived for a long time, and he eventually settled on a system based on the Church’s system for electing cardinals and the Pope.

John Schulman 12 Nov 2021 10:06 UTC
LW: 4 AF: 4
AF
in reply to: Ankesh Anand’s comment on: EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
I’m still not sure how to reconcile your results with the fact that the participants in the procgen contest ended up winning with modifications of our PPO/PPG baselines, rather than Q-learning and other value-based algorithms, whereas your paper suggests that Q-learning performs much better. The contest used 8M timesteps + 200 levels. I assume that your “QL” baseline is pretty similar to widespread DQN implementations.
https://arxiv.org/pdf/2103.15332.pdf
https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/leaderboards?challenge_leaderboard_extra_id=470&challenge_round_id=662
Are there implementation level changes that dramatically improve performance of your QL implementation?
(Currently on vacation and I read your paper briefly while traveling, but I may very well have missed something.)

John Schulman 7 Nov 2021 10:56 UTC
LW: 4 AF: 3
AF
in reply to: gwern’s comment on: EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
Performance is mostly limited here by the fact that there are 500 levels for each game (i.e., level overfitting is the problem) so it’s not that meaningful to look at sample efficiency wrt environment interactions. The results would look a lot different on the full distribution of levels. I agree with your statement directionally though.

John Schulman 25 Jun 2021 5:17 UTC
LW: 4 AF: 2
AF
in reply to: Charlie Steiner’s comment on: Frequent arguments about alignment
In my experience, you need separate teams doing safety research because specialization is useful—it’s easiest to make progress when both individuals and teams specialize a bit and develop taste and mastery of a narrow range of topics.

John Schulman 25 Jun 2021 4:36 UTC
LW: 4 AF: 4
AF
in reply to: Rohin Shah’s comment on: Frequent arguments about alignment
Yeah that’s also good point, though I don’t want to read too much into it, since it might be a historical accident.

John Schulman 31 May 2021 18:03 UTC
LW: 4 AF: 4
AF
in reply to: paulfchristiano’s comment on: Teaching ML to answer questions honestly instead of predicting human answers
D’oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here’s my current understanding.
Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there’s more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).
And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that’s resolved in Step 2.