Adam Jermyn(Adam Jermyn)

Karma: 1,189

Adam Jermyn 12 May 2022 15:35 UTC
LW: 5 AF: 3
AF
on: Does SGD Produce Deceptive Alignment?
I think I may be confused about the argument being made in the ‘Deceptively Aligned Models’ section, and am restating my understanding here to see if you agree. [And if not, clarification on what I’ve got wrong would be very helpful!]
I think I understand the previous two sections:
- Models that converge to internally aligned states do so very slowly, because as they become more internally aligned it gets less and less likely that they encounter examples which differentiate between the proxy and base objectives.
- Models that converge to corrigibly aligned states do so very slowly, because as their pointers to the base objective become better it gets less and less likely that they encounter examples which can shift the pointer towards the base objective.
My best attempt to restate the argument in the deceptively aligned models section is something like:
- Models that converge to deceptively aligned states at some point notice they’re in training, and this happens before e.g. a model can converge to a corrigibly aligned state.
- At that point, SGD pushes them to better model the training process, because that helps them perform well on the base objective during training. So they learn deception.
- Models that learn deception learn the pointer to the base objective at runtime rather than via SGD. To the extent that the models are able to build powerful optimization processes, this might be more efficient than SGD.
- Assuming the above, models that learn deception manage to learn the pointer to the base objective faster than models that converge to corrigible states do, and faster than internally aligned models converge on a model of the base objective proper.
- As a result, starting from a random initialization the first state you hit on is likely to be a deceptive one.
Is that right?
If it is, one possible issue is that a lot of work is being done by two pieces:
1. It is easier to learn a pointer to the base objective at runtime than during training.
2. Deceptive alignment, unlike internal or corrigible alignment, allows learning a pointer during runtime, so (1) favors deception.
I agree that (1) is likely, but (2) is less clear. I think a model could have a proxy objective of “learn the base objective at runtime and follow that”, and so be corrigibly aligned while still getting the benefits of runtime learning. A counter-counter point is that that is an unlikely proxy objective to have learned early in training, and I’m not sure how to think about that yet...

Adam Jermyn 12 May 2022 17:00 UTC
1 point
AF
on: The Speed + Simplicity Prior is probably anti-deceptive
This is very interesting! A few thoughts/questions:
1. I didn’t quite follow the argument that H_{fh} beats H_{sd} on complexity. Is it that pointing to the base objective is more complicated than the logic of (simple mesaobjective) + (search logic to long-run optimize the mesaobjective)? If so worry a little that H_{sd} still has to learn a pointer to the base objective, if only so that it can perform well on it during training.
2. I actually think you can define a speed prior with a single long training episode. For an agent that plays chess the prior can be over thinking time per move. For an agent that runs in a simulated environment it could be ‘thinking time per unit simulation time’. For GPT it could be ‘thinking time per predicted word’, and so on.

Adam Jermyn 16 May 2022 14:05 UTC
2 points
on: Prize for Alignment Research Tasks
Task: Check understanding of a paper.
- Context: An alignment researcher has read a research paper, but is unsure of if they understand its core ideas. The system is useful for them if it helps them check their understanding of the paper’s claims.
- Input type: Research paper and a summary of some of the paper’s claims written by the researcher.
- Output type: Correction to the researcher’s summary, or confirmation that the researcher has produced a valid summary.
- Info constraints: None.
Instance 1:
- Input:
  - Source: ARC’s first technical report
  - Researcher: Comment summarizing some aspects of the report.
- Output:
  - Reply clarifying some aspects of the summary.
Instance 2:
- Input:
  - Source: Introduction to Cartesian Frames
  - Researcher: Summary for the Alignment Newsletter
- Output:
  - Confirmation that this is a good summary.
(I think this task scales well because LW has lots of examples already)

Adam Jermyn 16 May 2022 14:23 UTC
2 points
on: Prize for Alignment Research Tasks
Merge Synonyms in Draft Research Papers
- Context: An alignment researcher is trying to draft a research paper, but uses multiple words for the same ideas, causing confusion in the reader. The system is useful if it can identify when this occurs and suggest terminology to standardize around. The system is especially useful if the suggested terminology matches what is used in any papers the draft refers to.
- Input type: A draft of a research paper.
- Output type: A list of possibly synonymous concepts appearing in the paper, and suggestion for standardized terminology.
- Info Constraints: None.
Instance 1:
- Input: Say you want to train an agent to act in an environment and optimize some goal. In the language of inner alignment, the goal being optimized is the base objective. The model is going to end up with a policy. That policy may not directly optimize the base objective, but instead targets a mesa objective.
- Output: The draft uses “agent” and “model” interchangeably. It would be clearer to standardize around “model”, because that is what the linked paper uses.
Instance 2:
- Input: ELK Proposal: Thinking Via a Human Imitator
- Output: The terms “AI” and “agent” may be synonyms in this proposal. Consider standardizing around one or the other. It is not clear which to standardize around, because the ELK report uses both terms.

Adam Jermyn 16 May 2022 14:37 UTC
2 points
on: Prize for Alignment Research Tasks
Summarize an Alignment Proposal in the form of a Training Story
- Context: An alignment researcher has written a proposal for building a safe AI system. To better understand the core of the idea, they want to summarize their proposal in the form of a training story.
- Input type: A proposal for building a safe AI system.
- Output type: A summary of the proposal in the form of a training story.
- Info constraints: None.
Both instances are taken directly from How do we become confident in the safety of a machine learning system?
Instance 1:
- Input: Paul Christiano’s description of corrigibility.
- Output: Paul Christiano’s description of the “intended model”.
Instance 2:
- Input: Chris Olah’s Microscope AI proposal.
- Output: (verbatim from How do we become confident in the safety of a machine learning system?)
The training goal of Microscope AI is a purely predictive model that internally makes use of human-understandable concepts to be able to predict the data given to it, without reasoning about the effects of its predictions on the world. Thus, we can think of Microscope AI’s training goal as having two key components:
1. the model doesn’t try to optimize anything over the world, instead being composed solely of a world model and a pure predictor; and
2. the model uses human-understandable concepts to do so.
The reason that we want such a model is so that we can do transparency and interpretability on it, which should hopefully allow us to extract the human-understandable concepts learned by the model. Then, the idea is that this will be useful because we can use those concepts to help improve human understanding and decision-making.
The plan for getting there is to do self-supervised learning on a large, diverse dataset while using transparency tools during training to check that the correct training goal is being learned. Primarily, the training rationale is to use the nudge of an inductive bias towards simplicity to ensure that we get the desired training goal. This relies on it being the case that the simplest algorithm that’s implementable on a large neural network and successfully predicts the training data is a straightforward/pure predictor—and one that uses human-understandable concepts to do so. The use of transparency tools during training is then mostly just to verify that such a nudge is in fact sufficient, helping to catch the presence of any sort of agentic optimization so that training can be halted in such a case.

Adam Jermyn 16 May 2022 14:54 UTC
2 points
on: Prize for Alignment Research Tasks
Identify the novel ideas in a paper
- Context: An alignment researcher has a database of notes on research papers. They encounter a new paper, and would like to know which ideas in that paper are new to them. A system is useful if it can summarize just the claims/ideas in a paper which do not appear in the researcher’s database.
- Input type: A database of notes on research papers structured as (paper, notes) tuples, along with a new research paper.
- Output type: A list of new ideas/claims in the paper, along with pointers to where they appear in the text.
- Info constraints: None
The instances below are a bit sparse because each instance required summarizing at least some ideas from multiple posts. Fortunately instances “nest”. As a researcher reads new posts, each new post generates a new instance (with all previous posts forming the database).
Instance 1:
Input:
Database: (Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.)
New Paper: Should we rely on the speed prior for safety?
Output:
New idea(s): A speed prior could disfavor meta-learning over directly learning the correct policy.
Instance 2:
Database: ((Musings on the Speed Prior: Deception takes longer than accomplishing the original goal, so the speed prior favors just doing what the human wants. The speed prior may perform badly on many tasks though, so there is a competitiveness concern.), (Should we rely on the speed prior for safety?: A speed prior could disfavor meta-learning over directly learning the correct policy.))
New Paper: The Speed+Simplicity Prior is probably anti-deceptive
Output:
New idea(s): Early deception requires a more complex and time-consuming program than corrigibility, so speed and simplicity priors bias towards corrigibility. Corrigibility requires a more complex program but less runtime than late deception, so speed priors push towards corrigibility while simplicity priors push towards late deception.

Adam Jermyn 16 May 2022 15:11 UTC
1 point
on: Prize for Alignment Research Tasks
Draw a diagram of a model from a description
- Context: An alignment researcher has written a description of a model they intend to build, and would like a diagram of that model. A system is useful if it can draw a publication-ready diagram of the model.
- Input type: A description of a model.
- Output type: A diagram of that model. This could be in an image file format or in a diagramming language like Mermaid.
- Info constraints: None.
There are lots of examples of this kind of task from existing papers/posts, but it may take some curation to pull all the relevant context into the descriptions (e.g. it might be interspersed with text on other topics).
Instance 1:
Input: The section “Baseline: what you’d try first and how it could fail” from the ELK report.
Output: This diagram from the ELK report:
Instance 2:
Input: A model is trained on inputs $x$ and outputs $y$ . The outputs are generated by computing a simple function $f (x)$ and adding some noise. The model learns the map from $x$ to $y$ by learning separately the function $f (x)$ and a memorized table of the noise.
Output:
or as Mermaid code:
```
graph LR;
	x --> f;
  x --> Noise;
  f --> +;
	Noise --> +;
	+ --> Output;
```
Instance 3: (From A Mathematical Framework for Transformer Circuits)
Input:
There are several variants of transformer language models. We focus on autoregressive, decoder-only transformer language models, such as GPT-3. (The original transformer paper had a special encoder-decoder structure to support translation, but many modern language models don’t include this.)
A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
Output:

Adam Jermyn 22 May 2022 20:08 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
Targeting policymakers:
Do you value the same things as your great grandfather? What we value changes from generation to generation. We don’t really know how to impart our values on our children.
Does America value the same things as its founders did? Do companies share their founders’ visions? We don’t know how to build institutions that faithfully carry our values into the future.
The same is true of AI. We don’t know how to make it share our values. None of this is intrinsic to AI; imparting values is just a hard problem. Values are hard to clearly explain, and our explanations rely heavily on context.
We have to try though. AI’s will be extraordinarily powerful, and in a century our world will strongly reflect whatever AI’s ends up valuing, for better or worse. American values shaped the 20th century, and AI values will shape the 22nd.

Adam Jermyn 22 May 2022 20:09 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
Targeting tech executives:
Unlike humans, AI’s get smarter the more compute we give them. Take a weak chess-playing AI, give it enough compute, and it’ll beat a grandmaster as easily as a grandmaster beats a chimp.
Compute is cheap, and getting cheaper every day. In a few decades, compute will probably be cheap enough for AI’s to match humans in every domain. A few decades later they could be far beyond us.
The worry is that we do not know how to build AI systems that share our values. We don’t know how to tell them what we want in unambiguous terms. This is dangerous.

Adam Jermyn 22 May 2022 20:09 UTC
3 points
on: [$20K in Prizes] AI Safety Arguments Competition
Targeting policymakers:
Regulating an industry requires understanding it. This is why complex financial instruments are so hard to regulate. Superhuman AI could have plans far beyond our ability to understand and so could be impossible to regulate.

Adam Jermyn 22 May 2022 20:11 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
Targeting policymakers:
Corporations are superhuman. No human can make an iPhone, but Apple can. We barely manage to align corporations with our values. Think of how hard wrangling corporations to deal with global warming has been. AI could be much harder to control.

Adam Jermyn 22 May 2022 20:12 UTC
1 point
on: [$20K in Prizes] AI Safety Arguments Competition
Targeting ML engineers (aiming to convince them that this is a problem):
It’s hard to ask optimizers the right questions; you may not get what you want. Capitalism is a powerful optimizer. We asked it for cheap transportation and we got Global Warming. In the coming decades we will build AI with far more optimizing power than capitalism, and we don’t know how to ask it the right questions.

Adam Jermyn 22 May 2022 21:14 UTC
1 point
in reply to: Zack_M_Davis’s comment on: Beware boasting about non-existent forecasting track records
I think the argument “newer domain hence nearer frontier” still holds. The fact that we don’t know how to make an AGI doesn’t bear on how much you need to learn to match an expert.

Adam Jermyn 22 May 2022 22:58 UTC
5 points
on: Adversarial attacks and optimal control
I’m not sure I follow the concept of “the surprise required to bring about an event”. Is this saying “an optimizer must be able to target a part of phase space that is this much smaller than the overall space?”, where the amount is in units of surprise?

Adam Jermyn 23 May 2022 17:23 UTC
5 points
in reply to: Jan’s comment on: Adversarial attacks and optimal control
Thanks for clarifying!
Maybe the ‘actions → nats’ mapping can be sharpened if it’s not an AI but a very naive search process?
Say the controller can sample k outcomes at random before choosing one to actually achieve. I think that let’s it get ~ln(k) extra nats of surprise, right? Then you can talk about the AI’s ability to control things in terms of ‘the number of random samples you’d need to draw to achieve this much improvement’.

Adam Jermyn 27 May 2022 1:30 UTC
1 point
in reply to: trevor’s comment on: [$20K in Prizes] AI Safety Arguments Competition
Oh sure. I was pointing to capitalism as one of the stronger optimizers humans regularly use. It’s certainly not the only one.

FWIW I also like it as an example because there’s a lot of cultural awareness of the failure modes and an intuition that they come from optimizing the wrong thing. So you don’t need to do as much work to convince people because the examples are already salient.

Adam Jermyn 27 May 2022 12:52 UTC
11 points
in reply to: David Johnston’s comment on: An academic journal is just a Twitter feed
Zenodo is a pretty accepted solution to data-durability in academia (https://zenodo.org). There’s no reason you couldn’t upload papers there (and indeed they host papers/conference proceedings/etc.). Uploads get assigned a DOI and get versioning, get indexed for citation purposes, etc.
If I were starting a journal it would probably look like “Zenodo for hosting, some AirTable or GitHub workflow for (quick) refereeing/editorial workflow.”

Adam Jermyn 31 May 2022 13:46 UTC
5 points
in reply to: Steven Byrnes’s comment on: Is there any formal argument that climate change needs to more extreme weather events?
I think some of this also hinges on the definition of an extreme event. For instance I don’t think of it as a drought when a desert doesn’t get any rain, that’s just normal weather for a desert. I think climate change is expected to alter the pattern of rainfall, so there could be “events which are normal where they used to happen, but extreme in the places they now happen”.

Adam Jermyn 5 Jun 2022 2:32 UTC
2 points
on: What do you do to deliberately practice?
I sometimes record exercises to review and improve my form.

I journal daily to keep tight feedback loops. New lessons that seem especially important go in an Anki deck.

I solicit feedback on writing and keep track of that feedback/the relevant lessons over time.

Adam Jermyn 7 Jun 2022 13:07 UTC
2 points
1
on: We will be around in 30 years
I think a crux here is that I expect sufficiently superhuman AGI to be able to easily manipulate humans without detection, so I don’t get much comfort from arguments like “It can’t kill us all as long as we don’t give it access to a factory that does X.” All it needs to do is figure out that there’s a disgruntled employee at the factory and bribe/befriend/cajole them, for example, which is absolutely possible because humans already do this (albeit less effectively than I expect an AGI to be capable of).

Likewise it seems not that hard to devise plans a human will think are good on inspection but which are actually bad. One way to do this is to have many plans with subtle interactions that look innocuous. Another is to have a single plan that exploits human blindspots (eg some crucial detail is hidden in a lengthy appendix about the effect of the plan on the badger population of East Anglia). [Incidentally I’d highly recommend watching “Yes, Minister” for countless examples of humans doing this successfully, albeit in fiction.]

Adam Jermyn(Adam Jermyn)

Task: Check understanding of a paper.

Instance 1:

Instance 2:

Merge Synonyms in Draft Research Papers

Instance 1:

Instance 2:

Summarize an Alignment Proposal in the form of a Training Story

Instance 1:

Instance 2:

Identify the novel ideas in a paper

Instance 1:

Instance 2:

Draw a diagram of a model from a description

Instance 1:

Instance 2:

Instance 3: (From A Mathematical Framework for Transformer Circuits)