PhD student in AI safety at CHAI (UC Berkeley)

# Erik Jenner

Thanks for the pointers, I hadn’t seen the Abstracting Abstract Machines paper before.

If you mean you specifically don’t get the goal of minimal abstractions under this partial order: I’m much less convinced they’re useful for anything than I used to be, currently not sure.

If you mean you don’t get the goal of the entire agenda, as described in the earlier agenda post: I’m currently mostly thinking about mechanistic anomaly detection. Maybe it’s not legible right now how that would work using abstractions, I’ll write up more on that once I have some experimental results (or maybe earlier). (But happy to answer specific questions in the meantime.)

In a previous post, I described my current alignment research agenda, formalizing abstractions of computations. One among several open questions I listed was whether unique minimal abstractions always exist. It turns out that (within the context of my current framework), the answer is yes.

I had a complete post on this written up (which I’ve copied below), but it turns out that the result is completely trivial if we make a fairly harmless assumption: The information we want the abstraction to contain is only a function of the

*output*of the computation, not of memory states. I.e. we only intrinsically care about the output.Say we are looking for the minimal abstraction that lets us compute , where is the computation we want to abstract, the input, and an arbitrary function that describes which aspects of the output our abstractions should predict. Note that we can construct a map that takes any intermediate memory state and finishes the computation. By composing with , we get a map that computes from any memory state induced by . This will be our abstraction. We can get a commutative diagram by simply using the identity as the abstract computational step. This abstraction is also minimal by construction: any other abstraction from which we can determine must (tautologically) also determine our abstraction.

This also shows that minimality in this particular sense isn’t enough for a good abstraction: the abstraction we constructed here is not at all “mechanistic”, it just finishes the entire computation

*inside the abstraction*. So I think what we need to do instead is demand that the abstraction mapping (from full memory states to abstract memory states) is simple (in terms of descriptive or computational complexity).Below is the draft of a much longer post I was going to write before noticing this trivial proof. It works even if the information we want to represent depends directly on the memory state instead of just the output. But to be clear, I don’t think that generalization is all that important, and I probably wouldn’t have bothered writing it down if I had noticed the trivial case first.

This post is fully self-contained in terms of the math, but doesn’t discuss examples or motivation much, see the agenda intro post for that. I expect this post will only be uesful to readers who are very interested in my agenda or working on closely connected topics.

# Setup

## Computations

We define a

*computation*exactly as in the earlier post: it consists ofa set of

*memory states*,a

*transition function*,a

*termination function*,a set of

*input values*with an*input function*,and a set of

*output values*with an*output function*.

While we won’t even need that part in this post, let’s talk about how to “execute” computations for completeness’ sake. A computation induces a function as follows:

Given an input , apply the input function to obtain the first memory state .

While is False, i.e. the computation hasn’t terminated yet, execute a computational step, i.e. let .

Output once is True. For simplicity, we assume that is always true for some finite timestep , no matter what the input is, i.e. the computation always terminates. (Without this assumption, we would get a

*partial*function ).

## Abstractions

An

*abstraction*of a computation is an equivalence relation on the set of memory states . The intended interpretation is that this abstraction collapses all equivalent memory states into one “abstract memory state”. Different equivalence relations correspond to throwing away different parts of the information in the memory state: we retain the information about which equivalence class the memory state is in, but “forget” the specific memory state within this equivalence class.*As an aside: In the original post, I instead defined an abstraction as a function for some set . These are two essentially equivalent perspectives and I make regular use of both. For a function , the associated equivalence relation is if and only if . Conversely, an equivalence relation can be interpreted as the quotient function , where is the set of equivalence classes under . I often find the function view from the original post intuitive, but its drawback is that many different functions are essentially “the same abstraction” in the sense that they lead to the same equivalence relation. This makes the equivalence relation definition better for most formal statements like in this post, because there is a well-defined set of all abstractions of a computation. (In contrast, there is no set of all functions with domain ).*An important construction for later is that we can “pull back” an abstraction along the transition function : define the equivalence relation by letting if and only if . Intuitively, is the abstraction of the next timestep.

The second ingredient we need is a partial ordering on abstractions: we say that if and only if . Intuitively, this means is at least as “coarse-grained” as , i.e. contains at most as much information as . This partial ordering is exactly the transpose of the ordering by refinement of the equivalence relations (or partitions) on . It is well-known that this partially ordered set is a complete lattice. In our language, this means that any set of abstractions has a (unique) supremum and an infimum, i.e. a least upper bound and a greatest lower bound.

One potential source of confusion is that I’ve defined the relation exactly the other way around compared to the usual refinement of partitions. The reason is that I want a “minimal abstraction” to be one that contains the

*least*amount of information rather than calling this a “maximal abstraction”. I hope this is the less confusing of two bad choices.We say that an abstraction is

*complete*if . Intuitively, this means that the abstraction contains all the information necessary to compute the “next abstract memory state”. In other words, given only the abstraction of a state , we can compute the abstraction of the next state . This corresponds to the commutative diagram/consistency condition from my earlier post, just phrased in the equivalence relation view.# Minimal complete abstractions

As a quick recap from the earlier agenda post, “good abstractions” should be complete. But there are two trivial complete abstractions: on the one hand, we can just not abstract at all, using equality as the equivalence relation, i.e. retain all information. On the other hand, we can throw away all information by letting for any memory states .

To avoid the abstraction that keeps around all information, we can demand that we want an abstraction that is

*minimal*according to the partial order defined in the previous section. To avoid the abstraction that throws away all information, let us assume there is some information we intrinsically care about. We can represent this information as an abstraction itself. We then want our abstraction to contain at least the information contained in , i.e. .As a prototypical example, perhaps there is some aspect of the output we care about (e.g. we want to be able to predict the most significant digit). Think of this as an equivalence relation on the set of outputs. Then we can “pull back” this abstraction along the output function to get .

So given these desiderata, we want the minimal abstraction among all

*complete*abstractions with . However, it’s not immediately obvious that such a minimal complete abstraction exists. We can take the infimum of all complete abstractions with of course (since abstractions form a complete lattice). But it’s not clear that this infimum is itself also complete!Fortunately, it turns out that the infimum is indeed complete, i.e. minimal complete abstractions exist:

**Theorem:**Let be any abstraction (i.e. equivalence relation) on and let be the set of complete abstractions at least as informative as . Then is a complete lattice under , in particular it has a least element.The proof is very easy if we use the Knaster-Tarski fixed-point theorem. First, we define the

*completion operator*on abstractions: . Intuitively, is the minimal abstraction that contains both the information in , but also the information in the*next*abstract state under .Note that is complete, i.e. if and only if , i.e. if is a fixed point of the completion operator. Furthermore, note that the completion operator is monotonic: if , then , and so .

Now define , i.e. the set of all abstractions at least as informative as . Note that is also a complete lattice, just like : for any non-empty subset , , so . in is simply . Similarly, since must be a lower bound on . Finally, observe that we can restrict the completion operator to a function : if , then , so .

That means we can apply the Knaster-Tarski theorem to the completion operator restricted to . The consequence is that the set of fixed points on is itself a complete lattice. But this set of fixed points is exactly ! So is a complete lattice as claimed, and in particular the least element exists. This is exactly the minimal complete abstraction that’s at least as informative as .

# Takeaways

What we’ve shown is the following: if we want an abstraction of a computation which

is complete, i.e. gives us the commutative consistency diagram,

contains (at least) some specific piece of information,

and is minimal given these constraints, there always exists a unique such abstraction.

In the setting where we want

*exact*completeness/consistency, this is quite a nice result! One reason I think it ultimately isn’t that important is that we’ll usually be happy with approximate consistency. In that case, the conditions above are better thought of as competing objectives than as hard constraints. Still, it’s nice to know that things work out neatly in the exact case.It should be possible to do a lot of this post

*much*more generally than for abstractions of computations, for example in the (co)algebra framework for abstractions that I recently wrote about. In brief, we can define equivalence relations on objects in an arbitrary category as certain equivalence classes of morphisms. If the category is “nice enough” (specifically, if there’s a set of all equivalence relations and if arbitrary products exist), then we get a complete lattice again. I currently don’t have any use for this more general version though.

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to “slice it up”; I expect that’s plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any “reasonable” variable transformation can in some sense be split into “local transformations”, each of which involve only a few variables. These local transformations aren’t a problem because if we, say, resample variables at a time, then transforming variables doesn’t affect redundant information.

I’m tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of “reasonable” in mind that’s more specific then the set of all descriptions physicists might want to use.

I also don’t see yet how exactly to make this work given local transformations—e.g. I think my version above doesn’t quite work because if you’re resampling a finite number of variables at a time, then I do think transforms involving fewer than variables can sometimes affect redundant information. I know you’ve talked before about resampling

*any*finite number of variables in the context of a system with infinitely many variables, but I think we’ll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don’t currently see how to avoid that.I’d also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn’t yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual “right” formalizations. (The more general principle here is to only add formality when it’s the

*right*formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don’t yet know the full right formality, then we should sketch at the level we think we do know.)Oh, I did not realize from your posts that this is how you were thinking about the results. I’m very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn’t help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I’d say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple—e.g. edge detectors in Olah’s early work, or the sinusoidal elements in Neel’s work on modular addition. But this falls apart quickly as the circuits get bigger—e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I’m also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you’re dealing with from just a few examples. But the important bit is that you start with a part of the model that’s actually a natural abstraction, which is why this approach doesn’t work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.

More generally, if you’re used to academia, then bear in mind the incentives of academia push towards making one’s work

*defensible*to a much greater degree than is probably optimal for truth-seeking.I agree with this. I’ve done stuff in some of my past papers that was just for defensibility and didn’t make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)

On the other hand, I’d guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia,

*don’t*successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I’m not sure how much that would help.) And I think there’s also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.It also seems likely that different levels of precision are genuinely right for different people (e.g. I’m unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I’m aware you have lots of details in your head that aren’t in your posts, and I’m also very much in favor of people just listening to their own research taste.

both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don’t think I currently have all the right pieces, and I’m trying to get the right pieces quickly.

Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).

Being both precise and readable at the same time is hard, man.

Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn’t make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It’s probably not that extreme, I think often it’s at least possible to find a good trade-off that works for most people, though hard).

It explained both my and its moves every time; those explanations got wrong earlier.

Note that at least for ChatGPT (3.5), telling it to not explain anything and only output moves apparently helps. (It can play legal moves for longer that way). So that might be worth trying if you want to get better performance. Of course, giving it the board state after each move could also help but might require trying a couple different formats.

# [Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

# Natural Abstractions: Key claims, Theorems, and Critiques

Outer alignment seems to be defined as models/AI systems that are optimizing something that is very close or identical to what they were programmed to do or the humans desire. Inner alignment seems to relate more to the goals/aims of a delegated optimizer that an AI system spawns in order to solve the problem it is tasked with.

This is not how I (or most other alignment researchers, I think) usually think about these terms. Outer alignment means that your loss function describes something that’s very close to what you actually want. In other words, it means SGD is optimizing for the right thing. Inner alignment means that the model you train is optimizing for its loss function. If your AI systems creates new AI systems itself, maybe you could call that inner alignment too, but it’s not the prototypical example people mean.

In particular, “optimizing something that is very close [...] to what they were programmed to do” is

*inner*alignment the way I would use those terms. An outer alignment failure would be if you “program” the system to do something that’s not what you actually want (though “programming” is a misleading word for ML systems).[Ontology identification] seems

__neither necessary nor sufficient [… for] safe future AI systems__FWIW, I agree with this statement (even though I wrote the ontology identification post you link and am generally a pretty big fan of ontology identification and related framings). Few things are necessary or sufficient for safe AGI. In my mind, the question is whether something is a useful research direction.

It does not seem necessary for AI systems to do this in order to be safe. This is trivial as we already have very safe but complex ML models that work in very abstract spaces.

This seems to apply to any technique we aren’t already using, so it feels like a fully general argument against the need for new safety techniques. (Maybe you’re only using this to argue that ontology identification isn’t strictly necessary, in which case I and probably most others agree, but as mentioned above that doesn’t seem like the key question.)

More importantly, I don’t see how “current AI systems aren’t dangerous even though we don’t understand their thoughts” implies “future more powerful AGIs won’t be dangerous”. IMO the reason current LLMs aren’t dangerous is clearly their lack of capabilities, not their amazing alignment.

Note that according to Bucky’s comment, we still get good play up to move 39 (with ChatGPT instead of Sydney), where “good” doesn’t just mean legal but actual strong moves. So I wouldn’t be at all surprised if it’s still mostly reasonable at move 50 (though I do generally expect it to get somewhat worse the longer the game is). Might get around to testing at least this one point later if no one else does.

Thanks for checking in more detail! Yeah, I didn’t try to change the prompt much to get better behavior out of ChatGPT. In hindsight, that’s a pretty unfair comparison given that the Stockfish prompt here might have been tuned to Sydney quite a bit by Zack. Will add a note to the post.

# Sydney can play chess and kind of keep track of the board state

Key hypothesis: neural nets or brains are typically initialized in a “scarce channels” regime. A randomly initialized neural net generally throws out approximately-all information by default (at initialization), as opposed to passing lots of information around to lots of parts of the net.

Just to make sure I’m understanding this correctly, you’re claiming that the mutual information between the input and the output of a randomly initialized network is low, where we have some input distribution and treat the network weights as fixed? (You also seem to make similar claims about things inside the network, but I’ll just focus on input-output mutual information)

I think we can construct toy examples where that’s false. E.g. use a feedforward MLP with any bijective activation function and where input, output, and all hidden layers have the same dimensionality (so the linear transforms are all described by random square matrices). Since a random square matrix will be invertible with probability one, this entire network is invertible at random initialization, so the mutual information between input and output is maximal (the entropy of the input).

These are unrealistic assumptions (though I think the argument should still work as long as the hidden layers aren’t lower-dimensional than the input). In practice, the hidden dimensionality will often be lower than that of the input of course, but then it seems to me like that’s the key, not the random initialization. (Mutual information would still be maximal

*for the architecture*, I think). Maybe using ReLUs instead of bijective activations messes all of this up? Would be really weird though if ReLUs vs Tanh were the key as to whether network internals mirror the external abstractions.My take on what’s going on here is that at random initialization, the neural network doesn’t pass around information in an

*easily usable*way. I’m just arguing that mutual information doesn’t really capture this and we need some other formalization (maybe along the lines of this: https://arxiv.org/abs/2002.10689 ). I don’t have a strong opinion how much that changes the picture, but I’m at least hesitant to trust arguments based on mutual information if we ultimately want some other information measure we haven’t defined yet.

[Erik] uses the abstraction diagram from the machine’s state into , which I am thinking of a general human interpretation language. He also is trying to discover jointly with , if I understand correctly.

Yep, I want to jointly find both maps. I don’t necessarily think of as a human-interpretable format—that’s one potential direction, but I’m also very interested in applications for non-interpretable , e.g. to mechanistic anomaly detection.

But then I can achieve zero loss by learning an that maps a state in to the uniform probability distribution over . [...] So why don’t we try a different method to prevent steganography—reversing the directions of and (and keeping the probabilistic modification)

FWIW, the way I hope to deal with the issue of the trivial constant abstraction you mention here is to have some piece of information that we intrinsically care about, and then enforce that this information isn’t thrown away by the abstraction. For example, perhaps you want the abstraction to at least correctly predict that the model gets low loss on a certain input distribution. There’s probably a difference in philosophy at work here: I

*want*the abstraction to be as small as possible while still being useful, whereas you seem to aim at translating to or from the entire human ontology, so throwing away information is undesirable.

I agree this is an exciting idea, but I don’t think it clearly “just works”, and since you asked for ways it could fail, here are some quick thoughts:

If I understand correctly, we’d need a model that we’re confident is a mesa-optimizer (and perhaps even deceptive—mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are “thresholds” where slight changes have big effects on how dangerous a model is.

If there’s a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.

Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.

It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don’t also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that’s actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?

ETA: I do want to highlight that this is still one of the most promising ideas I’ve heard recently and I really look forward to hopefully reading a full post on it!

The whole point of the post is to be a psychological framework for

*actually doing useful work that increases humanity’s long odds of survival.*I can’t decide whether “long odds” is a simple typo or a brilliant pun.

Thanks for your thoughts!

Overall, I’m skeptical of the existence of magic bullets when it comes to abstraction—by which I mean that I expect most problems to have multiple solutions and for those solutions to generalize only a little, not that I expect problems to have zero solutions.

Not sure I follow. Do you mean that you expect there to not be a single nice

*framework*for abstractions? Or that in most situations, there won’t be one clearly best abstraction?(FWIW, I’m quite agnostic but hopeful on the existence of a good general framework. And I think in many cases, there are going to be lots of very reasonable abstractions, and which one is “best” depends a lot on what you want to use it for.)

Sure, commuting diagrams / non-leaky abstractions have nice properties and are unique points in the space of abstractions, but they don’t count as solutions to most problems of interest. Calling commuting diagrams “abstraction” and everything else “approximate abstraction” is I think the wrong move—abstractions are almost as a rule leaky, and all the problems that that causes, all the complicated conceptual terrain that it implies, should be central content of the study of abstraction. An AI safety result that only helps if your diagrams commute, IMO, has only a 20% chance of being useful and is probably missing 90% of the work to get there.

I absolutely agree abstractions are almost always leaky; I expect ~every abstraction we actually end up using in practice when e.g. analyzing neural networks to not make things commute perfectly. I think I disagree on two points though:

While I expect the leaky setting to add new difficulties compared to the exact one, I’m not sure how big they’ll be. I can imagine scenarios where those problems are the key issue (e.g. errors tend to accumulate and blow up in hard to deal with ways). But there are also lots of potential issues that already occur in an exact setting (e.g. efficiently finding consistent abstractions, how to apply this framework to problems).

Even if most of the difficulty is in the leaky setting, I think it’s reasonable to start by studying the exact case,

*as long as insights are going to transfer*. I.e. if the 10% of problems that occur already in the exact case also occur in a similar way in the leaky one, working on those for a bit isn’t a waste of time.

That being said, I do agree investigating the leaky case early on is important (if only to check whether that requires a complete overhaul of the framework, and what kinds of issues it introduces). I’ve already started working on that a bit, and for now I’m optimistic things will mostly transfer, but I suppose I’ll find out soon-ish.

# Research agenda: Formalizing abstractions of computations

The terminology “RLHF” is starting to become confusing, as some people use it narrowly to mean “PPO against a reward model” and others use it more broadly to mean “using any RL technique with a reward signal given by human reviewers,” which would include FeedME.

Sorry for getting off track, but I thought FeedME did

*not*use RL on the final model, only supervised training? Or do you just mean that the FeedME-trained models may have been fed inputs from models that had been RL-finetuned (namely the one from the InstructGPT paper)? Not sure if OpenAI said anywhere whether the latter was the case, or whether FeedME just uses inputs from non-RL models.

Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I’m excited about:

I like the general ideas of looking for more comprehensive consistency checks (as in the “Better representation of probabilities” section), connecting this to mechanistic interpretability, and looking for things other than truth we could try to discover this way. (Haven’t thought much about your specific proposals for these directions)

Quite a few of your proposals are of the type “try X and see if/how that changes performance”. I’d be a bit weary of these because I think they don’t really help resolve uncertainty about the most important open questions. If one of these increases performance by 5%, that doesn’t tell you much about how promising the whole DLK approach is in the long term, or what the most likely failure modes are. If something doesn’t increase performance, that also doesn’t tell you too much about these.

Two exceptions to the previous point: (1) these types of experiments are pretty straightforward compared to more ambitious extensions, so I think they’re good if your main goal is to get more hands-on ML research experience. (2) Maybe you have some uncertainty about why/how the results in the paper are what they are, and making some small changes can give you evidence about that. This seems like an excellent thing to start with, assuming you have concrete things about the results you’re confused about. (Randomly trying things might also reveal surprising phenomena, but I’m much less sure that’s worth the time.)

So what could you aim for instead, if not improving performance a bit? I think one great general direction would be to look for cases where the current method just fails completely. Then work on solving the simplest such case you can find. Maybe the inverse scaling dataset is a good place to start, though I’d also encourage you to brainstorm other mechanistic ways why the current method might go wrong, and then come up with cases where those might happen. (Example of what I mean: maybe “truth” isn’t encoded in a linearly detectable way, and once you make the probe more complex, your constraints aren’t enough anymore to nail down the truth concept in practice).

ETA: I think the “adding labeled data” idea is a good illustration of what I’m talking about. Imagine you have problems where the method currently doesn’t work at all. If even large amounts of supervised data don’t help much on these, this suggests your probe can’t find a truth encoding (maybe because you’d need a higher capacity probe or if you already have that, maybe because the optimization is difficult). On the other hand, if you get good performance with supervised data, it suggests that you need stronger consistency checks. You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal. But if you do all this on a dataset where the unsupervised method already works pretty well, then the only evidence you get is something like “does it improve performance by 2%, 5%, 10%, …?”, the signal is less clear, and it’s much harder to say which of these explanations a 5% improvement indicates. All that is in addition to the fact that finding cases which are difficult for the current method is really important in its own right.

A lot of historical work on alignment seems like it addresses subsets of the problems solved by RLHF, but doesn’t actually address the important ways in which RLHF fails. In particular, a lot of that work is only necessary if RLHF is prohibitively sample-inefficient.

Do you have examples of such historical work that you’re happy to name? I’m really unsure what you’re referring to (probably just because I haven’t been involved in alignment for long enough).

The base model for text-davinci-002 and −003 is code-davinci-002, not davinci. So that would seem to be the better comparison unless I’m missing something.