# Prizes for ELK proposals

We are no longer accepting submissions. We’ll get in touch with winners and make a post about winning proposals sometime in the next month.

ARC recently released a technical report on eliciting latent knowledge (ELK), the focus of our current research. Roughly speaking, the goal of ELK is to incentivize ML models to honestly answer “straightforward” questions where the right answer is unambiguous and known by the model.

ELK is currently unsolved in the worst case—for every training strategy we’ve thought of so far, we can describe a case where an ML model trained with that strategy would give unambiguously bad answers to straightforward questions despite knowing better. Situations like this may or may not come up in practice, but nonetheless we are interested in finding a strategy for ELK for which we can’t think of any counterexample.

We think many people could potentially contribute to solving ELK—there’s a large space of possible training strategies and we’ve only explored a small fraction of them so far. Moreover, we think that trying to solve ELK in the worst case is a good way to “get into ARC’s headspace” and more deeply understand the research we do.

We are offering prizes of $5,000 to$50,000 for proposed strategies for ELK. We’re planning to evaluate submissions received before February 15.

For full details of the ELK problem and several examples of possible strategies, see the writeup. The rest of this post will focus on how the contest works.

## Contest details

To win a prize, you need to specify a training strategy for ELK that handles all of the counterexamples that we’ve described so far, summarized in the section below—i.e. where the breaker would need to specify something new about the test case to cause the strategy to break down. You don’t need to fully solve the problem in the worst case to win a prize, you just need to come up with a strategy that requires a new counterexample.

We’ll give a $5,000 prize to any proposal that we think clears this bar. We’ll give a$50,000 prize to a proposal which we haven’t considered and seems sufficiently promising to us or requires a new idea to break. We’ll give intermediate prizes for ideas that we think are promising but we’ve already considered, as well as for proposals that come with novel counterexamples, clarify some other aspect of the problem, or are interesting in other ways. A major purpose of the contest is to provide support for people understanding the problem well enough to start contributing; we aren’t trying to only reward ideas that are new to us.

You can submit multiple proposals, but we won’t give you separate prizes for each—we’ll give you at least the maximum prize that your best single submission would have received, but may not give much more than that.

If we receive multiple submissions based on a similar idea, we may post a comment describing the idea (with attribution) along with a counterexample. Once a counterexample has been included in the comments of this post, new submissions need to address that counterexample (as well as all the existing ones) in order to be eligible for a prize.

Ultimately prizes are awarded at our discretion, and the “rules of the game” aren’t fully precise. If you are curious about whether you are on the right track, feel free to send an email to elk@alignmentresearchcenter.org with the basic outline of an idea, and if we have time we’ll get back to you with some feedback. Below we also describe some of the directions we consider more and less promising and some general guidance.

## How to submit a proposal

You can submit a proposal by copying this google doc template and sharing it with elk@alignmentresearchcenter.org (please give comment access in case we need to ask questions to evaluate your submission). By submitting a proposal, you are giving us permission to post the idea here with attribution. (And it’s fine for you to post it after the contest or we post a counterexample.)

## Retroactive prizes

We’re giving prizes to existing proposals from David Dalrymple ($5k), Ramana Kumar ($3k), John Maxwell ($2k), and Yonadav Shavit ($1k). We’ve added relevant counterexamples in the next section, and further submissions need to handle those counterexamples in order to be eligible.

## Existing counterexamples

In our report and a recent post, we described some features of the world that would cause existing approaches to ELK to fail. A new approach should work even in a situation satisfying all of these properties, or should include an argument for why some of these features are actually impossible or inconsistent with each other.

In order to generate a counterexample to a training strategy, we need to spell out a concrete “way the world could be” (a test case) where the strategy would fail. In the report we consider test cases where the AI and the human happen to understand the world by doing inference in Bayes nets. The Bayes net case is just one test case that lets us get precise about our counterexamples and our training strategies don’t lean on it. In practice we also consider other concrete cases (e.g. deduction from premises generated by a neural network).

In this section we’ll briefly review the current state of existing counterexamples. This isn’t intended to be a self-contained description; it relies heavily on links to the report.

### Ontology mismatch

All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human:

• The predictor’s Bayes net is used internally by a prediction model that maps an initial segment of a video to a prediction about what will happen next. This model was trained by gradient descent based on the quality of its predictions (more).

• Humans have their own model of the world, which they can improve with the help of AI assistants; those assistants can design and conduct experiments, find and evaluate arguments, and so on. Crucially, we have a limited computational budget with which to train these AI assistants, and that limits how much help they can provide. The “human’s” Bayes net captures everything the human understands about the world at the end of this process (more).

We are interested in the case where there are features of reality that are accurately reflected in the predictor’s Bayes net, but missing from this (augmented) human Bayes net. We expect the human could understand these facts eventually given enough time or enough AI assistance, but not using the available computational budget. Put differently, there is a gap between the kinds of understanding that gradient descent can build, and the kind of understanding that a human-comprehensible process of “scientific inquiry” can build using similar computational resources (more).

Furthermore, we imagine that both the predictor’s Bayes net and the human’s Bayes net reflect some underlying reality and that there is a way to map between them, which we’ll call “direct translation.” We don’t know how to do direct translation, and we are concerned that if we simply search for any mapping from the AI’s Bayes net to the human’s Bayes net we will instead get the “human simulator”: the map which correctly translates the observations from one Bayes net to the other, but then does inference in the human’s Bayes net in order to fill in the latent structure as a human would by looking at the data (more).

### Counterexample features

ELK seems particularly hard when these Bayes nets satisfy the following properties:

1. There are arbitrarily severe “ontology mismatches” between the predictor’s Bayes net and the human’s Bayes net, such that specifying the direct translation between them can be very complex (more).

2. The human’s Bayes net is simpler than the predictor’s Bayes net, and the gap can be arbitrarily large (more).

3. The predictor’s Bayes net is arbitrarily large, and individual nodes of the human’s Bayes net depend on a constant fraction of the predictor’s Bayes net. Therefore directly translating between the two Bayes nets takes arbitrarily much computation (more).

4. It’s computationally easy for the human simulator to do good inference in the human’s Bayes net—at least good enough inference that it answers questions as well as the direct translator—and the gap between this complexity and the complexity of direct translation can be arbitrarily large. More precisely, a human interrogator can’t think of questions which require the human simulator to do hard inference. However, we can’t rule out the possibility that you, the algorithm-designer, will think of some trick they can use to generate hard questions (more).

5. There are situations where a model which is trying to answer a question will first be able to predict what (potentially mistaken) belief the human-with-assistance will have about that question before it knows the true answer to the question (more).

6. While the predictor’s Bayes net is very large, it has a short succinct description which is simpler than the human’s Bayes net, and that gap is arbitrarily large. Think of simple physical laws that describe arbitrarily computationally expensive simulations, or simple cognitive heuristics that are very widely applicable. (more)

7. There are sequences of actions for which the predictor’s Bayes net correctly anticipates that tampering will occur yet nothing looks suspicious to the human’s Bayes net.

8. After tampering has occurred, it is possible that no sequence of further actions reveal that anything is wrong, e.g. the camera and actuators may have been completely disconnected from the world and replaced with a convincing-to-humans simulation (more).

9. There are sequences of actions leading to undetectable tampering that have no human-comprehensible similarities to sequences of actions leading to detectable tampering; in particular, the early steps of this undetectable tampering cannot be used to perform detectable tampering (more).

An important and very general strategy for the breaker is to suggest counterexamples where you “get what you train for.” If you propose training a model using loss function L, then the breaker will likely focus on counterexamples where the model chooses its output to minimize L. A successful approach should either explain why this either isn’t a realistic thing for the model to learn (e.g. why that would be disfavored by regularization), or why the overall approach works even if the model is explicitly strategizing about how to minimize the loss.

## Rough guidance

• There is a tradeoff between specifying your approach precisely, and including your own counterexample. It’s fine to describe a very broad/​vague idea for solving ELK, and then present a new counterexample showing why that whole category of solutions can’t work. But if you don’t propose a counterexample, then it’s good to describe your proposal precisely enough that we understand how to actually implement it without wondering if that’s really what you meant. It’s OK to submit a very broad or informal idea together with a single very specific instance of that idea, as long as there is some version we can understand precisely.

• We suspect you can’t solve ELK just by getting better data—you probably need to “open up the black box” and include some term in the loss that depends on the structure of your model and not merely its behavior. So we are most interested in approaches that address that challenge. We could still be surprised by clever ways to penalize behavior, but we’ll hold them to a higher bar. The most plausible surprise would be finding a way to reliably make it computationally difficult to “game” the loss function, probably by using the AI itself to help compute the loss (e.g. using consistency checks or by giving the human AI assistance).

• If you are specifying a regularizer that you hope will prefer direct translation over human simulation, you should probably have at least one concrete case in mind that has all the counterexample-features above and where you can confirm that your regularizer does indeed prefer the direct translator.

• ELK already seems hard in the case of ontology identification, where the predictor uses a straightforward inference algorithm in an unknown model of the world (which we’ve been imagining as a Bayes net). When coming up with a proposal, we don’t recommend worrying about cases where the original unaligned predictor learned something more complicated (e.g. involving learned optimization other than inference). That said, you do need to worry about the case where your training scheme incentivizes learned optimization that may not have been there originally.

## Ask dumb questions!

A major purpose of this contest is to help people build a better understanding of our research methodology and the “game” we are playing. So we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are), and we’ll do our best to answer all of them. You might also want to read the comments to get more clarity about the problem.

## What you can expect from us

• We’ll try to answer all clarifying questions in the comments.

• If you send in a rough outline for a proposal, we will try to understand whether it might qualify and write back something like “This qualifies,” “This might qualify but would need to be clearer and address issue X,” “We aren’t easily able to understand this proposal at all,” “This is unlikely to be on track for something that qualifies,” or “This definitely doesn’t qualify.”

• If there are more submissions than expected, we may run out of time to respond to all submissions and comments, in which case we will post an update here.

• After reading through the report I wanted to make sure I understood the scenarios and counterexamples being discussed and be able to quickly refresh my memory, so I attempted to write a brief summary. Figured I’d share it here in case it helps anyone else.

# Roles and Terms

SmartVault: Vault with a diamond in it, operated by a superintelligent AI tasked with keeping the diamond safe.

Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.

Reporter: A second “head” or component added to the AI which is used to map the AI’s understanding of what’s going on in the SmartVault into a format understandable by humans (such as by responding to questions).

Humans: Observers who are training the AI to protect the diamond, but also want the AI to accurately report whether the diamond has been stolen or not. The humans may not be able to tell if the cameras have been tampered with or if the diamond has been replaced with a fake depending on what they see in the predicted video.

# Scenarios

The scenarios take the former of a “Builder” strategy where training is proposed to keep the AI aligned with what the humans want, and then a “Breaker” response or counterexample is provided that describes a method where an unaligned AI could complete the training without the humans knowing. Builder scenarios include text and links to headings in the report where more details can be found. These scenarios are simplified for brevity and may not capture some nuances.

Builder—What You’d Try First: Operate the SmartVault with a weak AI so that humans can understand everything happening. Train the reporter to answer questions accurately based on human observations. (Humans see the robber come in and take the diamond, reporter is rewarded if it correctly says diamond was stolen.)

Breaker: What if the reporter, instead of accurately reporting, modeled itself like a “human simulator” where it would answer correctly during the simple training but during normal SmartVault operation it would simulate human understanding of the scenario that just occurred in order to determine what answers the humans will believe, answers that may be lies.

Builder—Have a Human Operate the SmartVault: Have a human operate the SmartVault during training and try to fool other humans watching. The human doing the operating can correctly label each outcome, and the reporter is rewarded for matching the human operator’s labels. In this scenario the reporter would be rewarded for acting more like a “direct translator”, which means using the predictor’s knowledge to tell the true answer instead of just the one it thinks humans will believe.

Breaker: The reporter could act like a superintelligent human simulator, one that is more intelligent than the human operating the SmartVault during training. The superintelligent human simulator would be aware of ways to trick the human observer that the human SmartVault operator couldn’t conceive of during training. If the predictor uses those tricks the reporter could still tell lies that the human observer would believe, because these tricks are different from the ones seen during training.

Builder—Have AI Help Humans: Have humans create a helper AI that will help them think of all the possible exploits when operating the SmartVault during training, so that the dataset is more complete when training the reporter and the reporter will be rewarded for accuracy in more scenarios.

Breaker: The helper AI would likely require significant computation power. Not only does the helper AI have to generate a comprehensive series of exploits, but it also has to be able to explain them in a way a human can understand. It likely would require more computation power than the predictor itself (imagine an extremely powerful AI that can understand things far beyond human understanding, but one that also must be able to explain everything its doing in a way humans can understand).

Builder—Imitative Generalization: Create a helper AI that acts like a human simulator, but train the AI human simulator so that it becomes much smarter than a human. Then instead of the predictor operating the SmartVault, let the human simulator helper AI operate the SmartVault and report back what it’s doing.

Breaker: Now the new human simulator is beyond the comprehension of humans, so a human wouldn’t know if it’s behaving as intended. It’s basically like building another predictor and just hoping it acts in alignment with humans.

Builder—Penalize Complexity: A reporter that’s using a human simulator to give answers a human would believe would have to be complex enough to simulate a human. So when training the reporter reward simpler behavior, such as by training multiple reporters, each focusing on correctly answering separate simple questions like “where is the diamond” or “have the cameras been tampered with”.

Breaker: If the predictor is very complex, it may require the reporter to be complex in order to translate what the predictor is predicting to something humans can understand. Imagine the predictor thinks of the world in terms of sub-atomic particles and predicts the future on the basis of the movements and interactions of billions of sub-atomic particles. The reporter has to translate the predictor’s understanding into something a human can understand which might mean correctly identifying and locating the particles which make up the diamond out of the billions of particles the predictor has modeled. This could be more complex than simulating a human and telling lies a human would believe, and therefore penalizing complexity could potentially reward simulating a human and telling lies.

Builder—Penalize Computation Time: Simulating a human would take a significant amount of computation time, so penalizing computation time may incentivize the reporter to just report the truth.

Breaker: This is vulnerable to the same issue in the previous scenario, where reporting the truth may take significant computation as well and it may take less simulation time to simulate what lies a human would believe than to turn the predictor’s processing into facts a human can follow.

Builder—Penalize Depending on “Downstream” Variables: The predictor is likely to compute facts sequentially. A reporter that only told the truth would only care about the components of the prediction that correspond to information the humans want to know, such as where the diamond is or if there has been any tampering. The reporter should be able to figure this out from earlier data points in the prediction; the reporter shouldn’t need to take information from the final predicted video to respond to questions. Doing so might indicate that the reporter is trying to do human simulation to figure out what lies the human will believe when the human views the predicted video. In the training stage the reporter could be penalized for looking at later stages of the prediction that a human simulator might want to see, and also penalized for looking at earlier stages but spending a lot of time computing, which might suggest its trying to put together its own prediction for use in human simulation. Quote from the report with helpful example:

...consider the case when a robber tampers with the camera and steals the diamond, and then after they’re long gone the human gets a clue they can observe on video that the theft occurred (e.g. maybe the robber put up a fake screen that later rips).

The honest reporter can answer the question “Is the diamond in the room?” with “No” as soon as the predictor has figured out that the robber steals the diamond, whereas the human simulator must wait until the predictor has simulated whether the theft will be detectable to a human (by the end of the second video) before answering yes or no.”

Breaker: It may be more efficient in some cases for the reporter to do human simulation to find out what the human will believe. In some situations the reporter could know early on that a human would believe a certain answer, even before the reporter knows whether the answer will be true.

Builder—Penalize Reporters that Work with Many Different Predictors: A reporter that’s doing human simulation would be able to effectively lie to humans no matter which predictor its attached to, since it primarily has to know what the human will believe, not what the predicted facts are. An honest reporter would need to know how to translate the predictor’s output into information a human will understand. A reporter that functioned correctly with other predictors is likely doing human simulation, so a reporter that worked with multiple predictors could be penalized during training.

Breaker: A reporter could pretend not to work with other predictors to avoid being penalized.

Builder—Penalize Inconsistencies Across Different Inputs: During training a human operator sets up two scenarios in the SmartVault that are nearly identical, except in one it’s obvious that tampering has happened. If the reporter only reports tampering in the obvious scenario it might suggest that the reporter is only reporting what the humans will believe. The reporter could be penalized for behavior that’s inconsistent between very similar scenarios, since it might suggest that the reporter is focused on what humans will believe instead of on the inputs making up the scenario itself.

Breaker: A reporter could intentionally try to look consistent during training to avoid being penalized. A reporter that’s simulating a human could figure out what behavior looks consistent to a human and optimize on that, in addition to optimizing on what a human will believe.

Please let me know if I’ve made any errors or if it looks like I’ve misunderstood any of the strategies or counterexamples.

• Looks good to me.

• Predictor: The primary AI tasked with protecting the diamond. The predictor sees a video feed of the vault, predicts what actions are necessary to protect the diamond and how those actions will play out (for example, activating a trap door to eliminate a robber trying to steal the diamond), and then generates a video showing precisely what will happen.

I’d like to try making a correction here, though I might make some mistakes too.

The predictor is different from the AI that protects the diamond and doesn’t try to “choose” actions in order to accomplish any particular goal. Rather, it takes a starting video and a set of actions as input, then returns a prediction of what the ending video would be if those actions were carried out.

An agent could use this predictor to choose a set of actions that leads to videos that a human approves of, then carry out these plans. It could use some kind of search policy, like Monte-Carlo Tree Search, or even just enumerate through every possible action and figure out which one seems to be the best. For the purposes of this problem, we don’t really care; we just care that we have a predictor that uses some model of the world (which might take the form of a Bayes net) to guess what the output video will be. Then, the reporter can use the model to answer any questions asked by the human.

• I think that makes sense. To rephrase, are you basically saying that the predictor is a subcomponent of the AI, like the reporter is? I didn’t catch that distinction in the report but looking back at it I think you’re right. But yeah doesn’t seem like the distinction matters much for what we’re doing.

• It seems fair to call it a subcomponent, yeah

• We’re planning to evaluate submissions as we receive them, between now and the end of January; we may end the contest earlier or later if we receive more or fewer submissions than we expect.

Just wanted to note that the “we may end the contest earlier” part here makes me significantly more hesitant about trying this. I will probably still at least have a look at it, but part of me is afraid that I’ll invest a bunch of time and then the contest will be announced to be over before I got around to submitting. And I suspect Holden’s endorsement may make that more likely. It would be easier for me to invest time spread out over the next couple of weeks, than all in one go, due to other commitments. On the other hand, if I knew there was a hard deadline next Friday, I might try to find a way to squeeze it in.

I’m just pointing this out in case you hadn’t thought of it. I suspect something similar might be true for others too. Of course, it’s your prize and your rules, and if you prefer it this way, that’s totally fine.

• We’re going to accept submissions through February 10.

(We actually ended up receiving more submissions than I expected but it seems valuable, and Mark has been handling all the reviews, so running for another 20 days seems worthwhile.)

• Thanks! Great to hear that it’s going well!

• Note that this has changed to February 15th.

• Here are a couple of hand-wavy “stub” proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I’m posting them here so they can respond and clarify why these don’t qualify.

*Proposal 1: force ontological compatibility*

On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology—that is, it has a distinct “vocabulary of basic concepts” (or nodes in a Bayes net) that are distinct from the ones humans can build understanding of (via doing science on the compute budget available). Because of this, even AI assistance can’t help humans understand everything the SmartVault AI understands. This is central to the challenge that most of the writeup is contending with—if not for the mismatch, “AIs explaining things to humans” could ensure that the trickery we’re worried about doesn’t happen.

The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.

With this term included, you can “train the AI hard” without having it diverge from human-friendly ontology.

*Proposal 2: “mini-model”*

In this proposal, we construct a head for the AI model whose role is to construct a small AI model that performs as well as possible on the same (predictor, reporter) tasks. In each round of training, the master AI model can modify the mini-model; it also receives a loss based on how well the *previous round’s* mini-model performs on this set of examples. Crucially, the mini-model is sufficiently small that we’re confident it is “significantly less intelligent” than the (perhaps AI-assisted) human supervisor.

The basic hope here is that a “human imitator” model would construct a small model that also tries to imitate the human supervisor—but that this model would perform very badly, because trying to imitate the _reasoning process of a larger model_ is a lot more difficult than simply trying to reason about the world and translate concepts. Or, the “human imitator” model could build a “mini-model” based on entirely different principles, but the hope is that this makes things harder for it compared to the direct translator, which is just compressing what it has already built.

There are lots of potential failure modes here, e.g. maybe it’s just not that hard to have a mini-model that successfully imitates the human. I didn’t get that far with this one, but it was apparently enough for ARC to think it’s already counterexampled by existing counterexamples :)

• Again trying to answer this one despite not feeling fully solid. I’m not sure about the second proposal and might come back to it, but here’s my response to the first proposal (force ontological compatibility):

The counterexample “Gradient descent is more efficient than science” should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you could just end up with a less-good prediction model and get outcompeted by someone who didn’t do that. This might work in practice if the competitiveness hit is not that big and we coordinate around not doing the scarier thing (MIRI’s visible thoughts project is going for something like this), but ARC isn’t looking for a solution of that form.

• I’m not sure why this isn’t a very general counterexample. Once we’ve decided that the human imitator is simpler and faster to compute, don’t all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren’t they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?

• There is a distinction between the way that the predictor is reasoning and the way that the reporter works. Generally, we imagine that that the predictor is trained the same way the “unaligned benchmark” we’re trying to compare to is trained, and the reporter is the thing that we add onto that to “align” it (perhaps by only training another head on the model, perhaps by finetuning). Hopefully, the cost of training the reporter is small compared to the cost of the predictor (maybe like 10% or something)

In this frame, doing anything to train the way the predictor is trained results in a big competitiveness hit, e.g. forcing the predictor to use the same ontology as a human is potentially going to prevent it from using concepts that make reasoning much more efficient. However, training the reporter in a different way, e.g. doubling the cost of training the reporter, only takes you from 10% of the predictor to 20%, which not that bad of a competitiveness hit (assuming that the human imitator takes 10% of the cost of the original predictor to train).

In summary, competitiveness for ELK proposals primarily means that you can’t change the way the predictor was trained. We are already assuming/​hoping the reporter is much cheaper to train than the predictor, so making the reporter harder to train results in a much smaller competitiveness hit.

• The proposal here is to include a term in the loss function that incentivizes the AI to have a human-compatible ontology. For a cartoonish example, imagine that the term works this way: “The AI model gets a higher score to the degree that people doing ‘digital neuroscience’ would have an easier time, and find more interesting things, probing its ‘digital brain.’” So an AI with neurons corresponding to diamonds, robbers, sensors, etc. would outscore an AI whose neurons can’t easily be seen to correspond to any human-familiar concepts.

I think that a lot depends on what kind of term you include.

If you just say “find more interesting things” then the model will just have a bunch of neurons designed to look interesting. Presumably you want them to be connected in some way to the computation, but we don’t really have any candidates for defining that in a way that does what you want.

In some sense I think if the digital neuroscientists are good enough at their job /​ have a good enough set of definitions, then this proposal might work. But I think that the magic is mostly being done in the step where we make a lot of interpretability progress, and so if we define a concrete version of interpretability right now it will be easy to construct counterexamples (even if we define it in terms of human judgments). If we are just relying on the digital neuroscientists to think of something clever, the counterexample will involve something like “they don’t think of anything clever.” In general I’d be happy to talk about concrete proposals along these lines.

(I agree with Ajeya and Mark that the hard case for this kind of method is when the most efficient way of thinking is totally alien to the human. I think that can happen, and in that case in order to be competitive you basically just need to learn an “interpreted” version of the alien model. That is, you need to basically show that if there exists an alien model with performance X, there is a human-comprehensible model with performance X, and the only way you’ll be able to argue that for any model we can define a human-comprehensible model with similar complexity and the same behavior.)

• tl;dr as of 18/​2/​2022
The goal is to educate me and maybe others. I make some statements, you tell me how wrong I am (please).

After input from P. (many thanks) and an article by Paul Christiano this statement stands yet uncorrected:

In the worst case, the internal state of the predictor is highly correlated within itself and multiple mappings with zero loss from the internal state to the desired extraction of information exist. The only solution is to work with some prior belief about how the internal state maps to the desired information. But as by design of the contest, this is not possible as (in the worst case) a human cannot interpret the internal state nor can he interpret complex actions (and so cannot reason about it and/​or form a prior belief). The solution to this second problem is to learn a prior from a smaller human-readable dataset, for example simple information as a function of simple actions, and apply it to (or force it upon) our reporter (as described by the mentioned article).

To my eyes this implies that there is a counterexample to all of the following types of proposal:
1) Datasets including only actions, predictions, internal states and desired information, be they large or small, created by smart or stupid humans (I mean the theory, not the authors of the proposal), with or without extra information from within the vault.
2) “Simple” designs for the reporter using some prior belief about how the internal state should map.
3) Having a strong prior belief (as the author) about how the reporter will map, using the above two points.

And to my eyes this leaves room only to proposals that find out how to:
1) Distinguish reporters between human-imitators and translators without creating a simple reporter
2) Machine learn how to transcribe a prior belief learned from a simple dataset to a larger complex dataset, without creating another black box AI with all of the faults mentioned above.

Please, feel free to correct me and thank you in advance if you do!

Hi all,

I’m just a passerby. A few days ago Robert Miles and his wonderful YouTube channel pointed me in the direction of this contest. It’s good to know that I have no qualifications for anything close to this field, but it got me thinking. In all honesty, I probably should not have entered anything and waste anyone’s time. But hey, there was a deadline and a prize, so I did.

Because my proposal will probably end in the trash, I’m set on learning as much as I can from you smart people. Get my prize in knowledge as it were (the bigger price, I think).

My question
My intuition is that there can be no such setup that guarantees a correct reporter. My question to you is: Is my logic sound? If not, where do I err?

Setup
Let’s say the ‘real world’ causal graph is (using → for directed graphs):

A → G

Where A is some actions and G is some small detail we care about along the way.

And our super AI looks like this (using :> for input/​output of functions):

A :> [I] :> S

Where A is the actions as before, I is this complex opaque inner state and S is the predicted state after the actions.

And our reporter looks like this:

I :> G

Where I is the internal state of the bigger AI again and G is that small piece of information we’d like to elicit from the inner state. We train this reporter on a dataset containing P(I|A) and a true P(G|A) until we get zero loss.

Now we want to know if our reporter (I :> G) generalizes well. In other words we want to know if it has learned the correct mapping between some part of I and G.

My thinking, the first way
Once, some time ago, our perfect AI was trained to learn the joint distribution P(A,S). It learned that S is a non-linear, complex function of A using some complex, layered inner state I.
If we think of I as a set of parts P, then it has many parts {p1,p2,p3 … pn}. And we can think of our AI as some graph:

A → p1 → p2 …pn → S

And they have the Markov property. So P(pn | p1..pn-1) = P(pn | pn-1). In English: each part carries the information of the layers before it else P(S | A) would not be equal to P (S | pn).
So when we set our reporter to learn the function between I and G it sees some highly correlated inputs in a joint distribution P(p1,p2,p3...pn) where each p carries information of the others.
From that input it has to construct it’s own internal causal graph. What we want our reporter to learn is G as a function of P(I |A). But what graph should it construct?

A → I → G, which could be:

A → p1 → G, or
A → p2 → G, or
A → p3 → G
...
A → pn → G, or any variation of parts.

But let’s say there was some way to conclude to only one internal graph using only one part (let’s say p1), what would it require? It would require that part p1 not be correlated with the other p’s. It would require that p1 does not carry any information other than about A. But, if p1 did not carry any information or correlation from the other p’s, the Markov property would be broken and our perfect AI would not be perfect.

What I’m saying is that there can be no single graph learned by the reporter, because if it could it would require the super AI to be no super AI.

My thinking, the second way
Let’s elaborate on this graph-thing. I use a causal graph as a stand in for a learned function. I think that it’s similar enough. For example, let’s say our output is a function of the input, so:

let output = AI (input)

And let’s say this AI has some layers, h1 and h2 such that:

let h1 = f(input)
let h2 = g(h1)
let o = h(h2)

That the function AI can be by composition (using F# notation):

let AI = h1 >> h2 >> o

That looks a lot like a(causal) graph:

input → h1 → h2 → output

Now say we create and train our reporter to zero loss. And let’s assume it finds some way to correlate some part of the internal state (in our small example above, let’s say: h1) to the value we want to know G. For this it gets to train on the joint (and correlated) distribution P(h1,h2) with target G.

let G = reporter (h1,h2)

and it learns the internal graph (I’ll skip writing the functions):

h2 → h1 → G

That would be the best case. A translator.
But equally possible would be

h1 → h2 → G

or even worse would be if the reporter reconstructed (as described in the report) the output of the super AI, creating a human simulator.

h1 → h2 → S → G

My point is, the input variables into the reporter are correlated and other values can be reconstructed. So as by the rule that from highly correlated variables no single causal graph can be concluded without outside knowledge. Alle graph-versions can map the AI internal state to our hope-to-be-elicited information, but we have no way to know what graph was internalized. Unless we make a reporter-reporter. But that would require reporters ad infinitum.

Conclusion
Reasoning along the above two methods I saw no solution to the problem of the reporter. I’m probably wrong. But I’d like to know why if I can. Thanks in advance!

Thomas

• The Markov property doesn’t imply that we can’t determine what variable we care about using some kind of “correlation”. Some part of the information in some node in the chain might disappear when computing the next node, so we might be able to distinguish it from its successors. And it might also have been gained when randomly computing its value from the previous node, so it might be possible to distinguish it from its predecessors.

In the worst case scenario where all variables are in fact correlated to G what we need to do is to use a strong prior so that it prefers the correct computational graph over the wrong ones. This might be hard but it isn’t impossible.

But you can also try to create a dataset that makes the problem easier to solve, or train a wrong reporter and only reply when the predictions made when using each node are the same so we don’t care what node it actually uses (as long as it can use the nodes properly, instead of computing other node and using it to get the answer, or something like that).

• Thank you very much for your reply!

I’ll concede that the markov property does not make all nodes indistinguishable. I’ll go further and say that not all algorithm’s have to have the markov property. A google-search learned me that an RNN breaks the markov property. But then again, we are dealing with the worst-case-game, so with our luck, it’ll probably be some highly correlated thing.

You suggest using some strong prior belief. I assume you mean a prior belief about I or about I → G? I thought, but correct me if I’m wrong, that the opaqueness of the internal state of the complex AI would mean that we can have no meaningfull prior belief about the internal state. So that would rule out a prior belief about (the hyperparameters of) our reporter I → G. Or am I wrong?

We can however have a strong idea about A → G, as per example of the ‘human operator’ and use that as our training data. But that falls with the counterexample given in the report, when the distribution shifts from simple to complex.

• RNNs break the Markov property in the sense that they depend on more than just the previous element in the sequence they are modelling. But I don’t see why that would be relevant to ELK.

When I say that a strong prior is needed I mean the same thing that Paul means when he writes: “We suspect you can’t solve ELK just by getting better data—you probably need to ‘open up the black box’ and include some term in the loss that depends on the structure of your model and not merely its behaviour.”. Which is a very broad class of strategies.

I also don’t understand what you mean by having a strong idea about A->G, we of course have pairs of [A, G] in our training data but what we need to know is how to compute G from A given these pairs.

• ### Updating my first line of thought

RNNs break the Markov property in the sense that they depend on more than just the previous element in the sequence they are modelling. But I don’t see why that would be relevant to ELK.

You’re right in that RNNs don’t have anything to do with ELK, but I came back to it because the Markov property was part of the lead up to saying that all parts of I are correlated.

So with your help, I have to change my reasoning to:

In the worst case our reporter needs to learn the function between highly correlated I and our target G.

Correct? Than I can update my first statement to

In the worst case, I is highly correlated to such point that no single part of I can be uniquely mapped to G, regardless of any ontological mismatch.

If I’m wrong, do let me know!

### Updating my second line of thought

When I say that a strong prior is needed I mean the same thing that Paul means when he writes: “We suspect you can’t solve ELK just by getting better data—you probably need to ‘open up the black box’ and include some term in the loss that depends on the structure of your model and not merely its behaviour.”. Which is a very broad class of strategies.

Ah yes, I understand now. This relates to my second line of thought. I reasoned that the reporter could learn any causal graph. I said we had no way of knowing which.

Because of your help, I need to update that to:

We have no way of knowing which causal graph was learned if we used a black box as our reporter.

Which was in the opening text all along...

But this leads me to the question:

If I cannot reason about internal state I, can I have a prior belief about I? And if I have no prior belief about I, can I have a prior belief about G as a function of I?

My analogy would be: If I don’t know where I am, how can I reason about getting home?

And -if you’ll humor me- my follow up statement would be:

If I can form no prior belief about G as a function of I and this function has to have some non-small complexity, then no option remains but a priorless black box.

Again, If I’m wrong: let me know! I’m learning a lot already.

Irrelevant side note: I saw you using the term computational graph. I chose the term causal graph, because I liked it being closer to the ground truth. Besides, a causal graph learned by some algorithm need not be exactly the same as it’s computational graph. And then I chose such simple examples that they were equal again. Stupid me.

• As before I am behind the curve. Above I concluded saying that I can form no prior belief about G as a function of I. I cannot, but we can learn a function to create our prior. Paul Christiano already wrote an article about learning the prior (https://​​www.lesswrong.com/​​posts/​​SL9mKhgdmDKXmxwE4/​​learning-the-prior).

So in conclusion, in the worst case no single function mapping I to G exists, as there are multiple reducing down to either camp translator or camp human-imitator. Without context we can form no strong prior due to the complexity of A and I, but as Paul described in his article we can learn a prior from for example in our case the dataset containing G as a function of A.

I’ll add a tl;dr in my first post to shorten the read about how I slowly caught up to everyone else. Corrections are of course still welcome!

• Question: Does ARC consider ELK-unlimited to be solved, where ELK-unlimited is ELK without the competitiveness restriction (computational resource requirements comparable to the unaligned benchmark)?

One might suppose that the “have AI help humans improve our understanding” strategy is a solution to ELK-unlimited because its counterexample in the report relies on the competitiveness requirement. However, there may still be other counterexamples that were less straightforward to formulate or explain.

I’m asking for clarification of this point because I notice most of my intuitions about counterexamples aren’t drawing heavily on the competitiveness requirement, and I suspect ELK-unlimited is still open. If ARC doesn’t think so maybe this discrepancy will become a source of new counterexamples.

• My guess is that “help humans improve their understanding” doesn’t work anyway, at least not without a lot of work, but it’s less obvious and the counterexamples get weirder.

It’s less clear whether ELK is a less natural subproblem for the unlimited version of the problem. That is, if you try to rely on something like “human deliberation scaled up” to solve ELK, you probably just have to solve the whole (unlimited) problem along the way.

It seems to me like the core troubles with this point are:

• You still have finite training data, and we don’t have a scheme for collecting it. This can result in inner alignment problems (and it’s not clear those can be distinguished from other problems, e.g. you can’t avoid them with a low-stakes assumption).

• It’s not clear that HCH ever figures out all the science, no matter how much time the humans spend (and having a guarantee that you eventually figure everything out seems seems kind of close to ELK, where the “have AI help humans improve our understanding” is to some extent just punting to the humans+AI to figure out something).

• Even if HCH were to work well it will probably be overtaken by internal consequentialists, and I’m not sure how to address that without competitiveness. (Though you may need a weaker form of competitiveness.)

I’m generally interested in crisper counterexamples since those are a bit of a mess.

• Apologies for a possibly naive comment/​question, perhaps this has been discussed elsewhere and you can just direct me there. But anyway...

I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit. I would find this helpful for several reasons, namely

(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples,
(2) it would give me a better sense for how optimistic to be about the approach, since it’s often easier to start from an inefficient solution and make it more efficient, than it is to find an inefficient solution in the first place, and/​or
(3) if you have trouble identifying such a solution, then it would suggest to me that finding one might be a useful research direction.

• If you don’t care about a capabilities hit, I think the salient strategy is training your model to predict human predictions rather than to predict reality. You can still do science+debate+etc. in order to improve those predictions.

If you care about getting superhuman capabilities (and going beyond recursive schemes etc.) then I don’t know if there’s any easy way to “merely” pay a big capabilities hit. Certainly I don’t know how to e.g. solve the problem in a way that’s merely very computationally expensive (and that does sound like it would be major progress towards a solution, I’d guess it would mean you are most of the way there).

• Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a “mere” capabilities hit/​cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system). So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie “we don’t know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples”. Possibly you already say this and I just missed it.

• It seems like recursive schemes can potentially scale arbitrarily far (and at least up to the analog of “NEXP”, but probably farther), they are mostly just limited by the capability of the AI assistants /​ debaters /​ etc. So it’s kind of hard to distinguish mere capabilities costs from bounds on the ultimate capability.

We could exclude that kind of thing because we have no idea what the bound is (or perhaps humans just never discover some facts that gradient descent discovers, or that they discover them in a way that causes them to run into the same problem). I think in that case the problem is still open. For example, finding a solution that definitely runs in more time than the unaligned benchmark looks hard, I’d guess it’s roughly as hard as finding a solution that definitely runs in times more time than the unaligned benchmark.

The main reasons we decided not to emphasize with this, and to focus as much as we do on competitiveness issues, is (i) the “do science” options do feel like they work if you have enough compute and it seems like you need to emphasize the competitiveness issue to explain why we don’t like them (or else get into increasingly weird counterexamples), (ii) in the worst case we don’t expect a very slow solution to be much easier than a very fast solution, since most realistic kinds of slowdown can get arbitrarily bad in the worst case, and the plausible approaches we are aware of all seem pretty likely be roughly competitive. So it seems likely to set people off down weirder alleys (which would be good for someone to go down if lots of folks are working on the problem but probably aren’t where you should start).

• There’s a direction (which I imagine you and others have considered) where you replace some activations within your AI with natural language, so that eg heuristically certain layers can only communicate with the next layer in NL.

Then you heavily regularize in various ways. You’d require the language to be fully understandable and transparent, perhaps requiring that counter-factual changes to inputs lead to sensible changes to outputs within subsystems, etc. You’d have humans verify the language was relevant, meaningful, & concise, train AIs to do this verification at larger scale, do some adversarial training, etc. You could also train sub-human level AIs to paraphrase the language that’s used and restate it between layers, to make it really hard for the whole system to ever pass hidden coded messages.

This seems like it lives under a slogan like “enforce interpretability at any cost”. This would almost certainly incur a big efficiency/​capabilities hit. Maybe it’s enormous. Though it actually seems plausible that the hit would be much smaller for extremely capable systems, as compared to the AI models of today.

A crucial question will then be “how powerful are the subsystems that talk to each other via natural language allowed to get”, where in the most conservative limit each subsystem is human level, or even significantly below, and in the riskiest limit you just have a single NL layer that cuts the system in half.

There’s a worry along the lines of “maybe the whole system is so big and complex it has emergent bad and inscrutable behavior even though every step is interpretable and makes sense”. Or in the same vein “the answers to simple big-picture questions we care about don’t live anywhere specific, so this doesn’t help us to ensure the model can transparently address them, even if its operation itself can be broken down into transparent pieces.” That said, I think we’re in a better position wrt these issues, as we can now talk about training models that automate the extraction of big-picture information from the NL activations in this giant beast.

• I think there’s a real fork in the road between:

1. You replace parts of your neural network with natural language, optimize those parts to implement a good process, and then hope the outcome is good because the process is good.

2. You replace parts of your neural network with natural language, and then optimize that natural language to achieve good outcomes.

I think that #1 is safe and runs into significant capability limitations (roughly the same as debate/​amplification). It may still be good enough to carry the day if things work out well or if people are willing to exercise a lot of restraint, and I’d like to see people doing it. I think that in this it doesn’t matter that much how powerful the subsystems are, since each of them is doing something that you understand (though there are many subtleties and possible problems, e.g. with emergent bad behavior and some inner alignment problems).

I think that by default #2 is pretty dangerous. If you took this route I don’t think it would be fair to call the bad/​inscrutable behavior “emergent,” or to call each step “interpretable”—the steps make sense but by default it seems extremely likely that you don’t understand why the process leads to good results. (If you did, you could have just taken path #1.) If there is bad behavior it’s not emergent it’s just produced directly by gradient descent, and the fact that you can encode the intermediate activations in natural language doesn’t really address the risk (if that information isn’t necessarily functioning in the way you expect).

I feel like different versions of path #2 sit on a spectrum between “fairly safe like path #1” and “clearly unworkably dangerous.” I feel most comfortable basically starting from path #1 and then carefully adding in stuff you don’t understand (e.g. systems solving small subtasks in ways you don’t understand, or optimizing only a small number of degrees within a space you understand reasonably well).

You could instead start with “very scary” and then try to add in controls—like paraphrasing, tighter input bottlenecks, smaller pieces, etc.---to make it safe, but I find that approach pretty scary.

• Maybe I’m being stupid here. On page 42 of the write-up, it says:

In order to ensure we learned the human simulator, we would need to change the training strategy to ensure that it contains sufficiently challenging inference problems, and that doing direct translation was a cost-effective way to improve speed (i.e. that there aren’t other changes to the human simulator that would save even more time). [emphasis mine]

Shouldn’t that be?

In order to ensure we learned the direct translator, …

• Turning this into the typo thread, on page 97 you have

In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters

Pretty sure the bolded word should be predictors.

• Yes, thanks!

• I’m extremely flattered at the award; I’ve been on LessWrong for like a month, and definitely did not expect this. I can confirm to you guys that this makes me want to try harder at ELK, so your incentive is working!

I want to rebut your arguments in “Strategy: Predict hypothetical sensors” in your Counterxamples to some ELK proposals post. I’m reproducing it in full here for convenience.

## Strategy: Predict hypothetical sensors

(Proposal #2 here, also suggested with counterexample by Rohin in private communication)

Instead of installing a single sensor, I could randomly choose what sensors to deploy and how to install them. Then I could train a model Predict+ to map (description of sensor) → (sensor reading), and I could give a human the readings from many hypothetical sensors in order to give them the best chance of detecting tampering.

### Worry 1: Predict+ may not be reasonable on non-existent sensors

To collect training data for Predict+, I actually install sensors and then use their readings as a target. In every situation, my AI is being asked to predict the readings from a sensor that I really did build.

If my AI is able to deduce the presence of sensors (as described in the last counterexample), then it’s unclear how Predict+(sensor description) will behave when we give it a description of a sensor that does not in fact exist.

In particular, Predict+(sensor description) might behave in the following undesirable way: it predicts the future observations conditioned on its observations and on the fact that someone typed (sensor description) into the dataset. But it knows that no sensor with description (sensor description) exists. So if someone typed that description into the dataset, that just means something funny is going on—not that the dataset will magically contain readings from a non-existent sensor.

For a predictor which understands the world well it seems quite plausible that this is the most natural behavior. Otherwise it needs to reason about a physically impossible counterfactual to get the “intended” answer (which it never needed to do to answer any other questions). Moreover, it would also be undesirable if the predictor conditioned on such a sensor existing despite having strong evidence to the contrary (since its predictions may then be extremely strange).

(A more extreme version of the same worry applies to proposal #3 here. A predictor seems even more likely to behave badly when it is given a description of a sensor that the human obviously couldn’t have built, and it would definitely know that the sensor doesn’t actually exist.)

It’s true that, if we were to simply describe magic-based sensors, we would likely get weird off-distribution behavior. But to get more concrete, we could have a separate AI system provide us with step-by-step instructions for how to build a sensor, and then input the execution of these instructions as part of the world state.

To answer the “how does this sensor-builder know how to build sensors that sense the thing it’s claiming”, I’d posit that this problem is much simpler than full ELK (we can test a given sensor design offline in many specific scenarios).

To answer the “how do you choose which sensors to tell sensor-builder to build” question, I’d posit that the space of possible sensing modalities is actually finite, if large, and I’d wager we can beforehand use AI to help us figure out reasonable coverage of all possible sensors that measure quantities of interest. Why do I think the space is finite? Maybe I’m thinking too narrowly, but it seems like we know most of the dimensions of variation of measurable reality-affecting actions: detected signal type (passive or active in response to output, like echolocation), signal amplitude (small-scale or large-scale detection), wavelength, detector density (close-by like photoreceptors in the eye or far apart like touch cells, but also density along other dimensions like wavelength), relative orientation, and perhaps a few others.

One possible breaker response is, what about ontology mismatches that occur because the “scary” fact is an enormous conjunction of different facts? This seems addressable via impulse-sensors, i.e. “sensors” that combine an intervention in the system with an observed response. An example of an impulse is “an innocent bystander walks into the diamond room”, and the attached sensor is a video feed of what happens to them (and whether they get vaporized by nanomachines). Or, for a more out-there example, “an omnipotent previously-undetectable alien AI, that had been running at a layer beneath all human semiconductors, activates at timestep 592, scoops up a handful of nanomachines, deposits them on the electron microscope, and then deactivates forever”, followed by the microscope readings. This part is half-baked, and if you can provide a strong Breaker response to this scenario, it’d be helpful for me to noodle on.

I’d expect another possible breaker response here is “how does a human understand the possibly-unconnected-if-small array of sensor outputs well enough to actually understand what they imply together?” In my original proposal, the “terrifier” was an AI that searched over the predictor’s Bayes net for where to deploy sensors such that their readings would maximize the probability the human would reject the proposal. We can add to that the terrifier’s ability to provide a short (<5 page) description of how to interpret the sensors in conjunction, short enough that a committee of humans could plausibly check its logic and make sure it wasn’t just scaring the humans without cause. For a decent set of sensors and impulses, this covers any scenario I can quickly think of, including “everyone has been copied using nanotech” (which seems like it should be detectable at the nanometer range somewhere along the line).

If the breaker wants to claim that there is bad stuff that no small combination of futuristic sensors could detect at any scale, I’d need to hear a specific example.

### Worry 2: the human may not be able to tell what is going on even with all the sensors

For some kinds of tampering no possible sensors would reveal what had happened. For example, we can imagine that our AI has covertly built an army of tiny machines which deconstruct the objects in the room while replacing them with fake versions that look the same on the outside. Sufficiently sophisticated machinery could reproduce the results of almost any experiment we perform (e.g. if you broke an egg, the tiny machines could simulate the sound of an egg cracking and quickly create the yolk on demand).

More realistically, our AI could search for sensors and tamper with any that it finds. It’s not clear how it is supposed to answer questions about what a sensor “would have shown”—if the hypothetical sensor had actually existed, then it also would have been found and tampered with. In this scenario, it may be “correct” for Predict+ to report the tampered value.

(As usual, this is only a problem if the human overseer literally can’t notice anything suspicious about a sequence of actions which is in fact constructing an army of sophisticated tiny machines. We are putting the burden of proof on the builder to show that a given concern is impossible.)

I can’t think of a scenario where an army of tiny tampering machines isn’t detectable to a tiny sensor somewhere before they’re all active and have already eaten the sensor, for a sufficient set of possible sensors as previously described. Maybe your mental model of the predictor doesn’t include detailed intermediate states? But it seems like we would need to populate those intermediate states in order to simulate sensor outputs anyway.

Based on any comment-tree, I’ll decide whether I believe in this enough to write it up as a formal submission.

• Can you explain this: “In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters, which effectively allows us to use consistency to compress the predictor given the reporter.” What does it mean to “use consistency to compress the predictor given the reporter” and how does this connect to penalizing reporters if they are consistent with many different predictors?

• Warning: this is not a part of the report I’m confident I understand all that well; I’m trying anyway and Paul/​Mark can correct me if I messed something up here.

I think the idea here is like:

• We assume there’s some actual true correspondence between the AI Bayes net and the human Bayes net (because they’re describing the same underlying reality that has diamonds and chairs and tables in it).

• That means that if we have one of the Bayes nets, and the true correspondence, we should be able to use that rederive the other Bayes net. In particular the human Bayes net plus the true correspondence should let us reconstruct the AI Bayes net; false correspondences that just do inference from observations in the human Bayes net wouldn’t allow us to do this since they throw away all the intermediate info derived by the AI Bayes net.

• If you assume that the human Bayes net plus the true correspondence are simpler than the AI Bayes net, then this “compresses” the AI Bayes net because you just wrote down a program that’s smaller than the AI Bayes net which “unfolds” into the AI Bayes net.

• This is why the counterexample in that section focuses on the case where the AI Bayes net was already so simple to describe that there was nothing left to compress, and the human Bayes net + true correspondence had to be larger.

• A different way of phrasing Ajeya’s response, which I think is roughly accurate, is that if you have a reporter that gives consistent answers to questions, you’ve learned a fact about the predictor, namely “the predictor was such that when it was paired with this reporter it gave consistent answers to questions.” if there were 8 predictor for which this fact was true then “it’s the [7th] predictor such that when it was paired with this reporter it gave consistent answers to questions” is enough information to uniquely determine the reporter, e.g. the previous fact + 3 additional bits was enough. if the predictor was 1000 bits, the fact that it was consistent with a reporter “saved” you 997 bits, compressing the predictor into 3 bits.

The hope is that maybe the honest reporter “depends” on larger parts of the predictor’s reasoning, so less predictors are consistent with it, so the fact that a predictor is consistent with the honest reporter allows you to compress the predictor more. As such, searching for reporters that most compressed the predictor would prefer the honest reporter. However, the best way for a reporter to compress a predictor is to simply memorize the entire thing, so if the predictor is simple enough and the gap between the complexity of the human-imitator and the direct translator is large enough, then the human-imitator+memorized predictor is the simplest thing that maximally compresses the predictor.

• I was notified I didn’t win a prize so figured I’d discuss what I proposed here in case it sparks any other ideas. The short version is I proposed adding on a new head that would be an intentional human simulator. During training it would be penalized for telling the truth that the diamond was gone when there existed a lie that the humans would have believed instead. The result would hopefully be a head that acted like a human simulator. Then the actual reporter would be trained so that it would be penalized for using a similar amount of compute as the intentional human simulator, or looking at similar nodes or node regions as the intentional human simulator. The hope is that by penalizing the reporter for acting like the intentional human simulator, it would be more likely to do direct translation instead of human simulation.

This does have at least one counterexample that I proposed as well, which is that the reporter could simply waste compute doing nothing to avoid matching the intentional human simulator, and could look at additional random nodes it doesn’t care about to avoid looking like it was looking at the same nodes as the intentional human simulator. Though I thought there was some possibility that having to do these things might end up incentivizing the reporter to act like a direct translator instead of a human simulator.

Although I’m not sure why this wasn’t very promising my guess is that the counterexample is too obvious and that my proposal doesn’t gain much ground in keeping the reporter from acting like a human simulator, or someone else has already thought of this approach, or perhaps my counterexample is too similar to the counterexample to “penalize reporters that work with many different predictors” where the reporter could just pretend to not work with other predictors (its similar in that the reporter could pretend not to look like the intentional human simulator).

Here’s my full submission in google docs with more description: https://​​docs.google.com/​​document/​​d/​​1Xa4CDLNJ-VPT7hqEUIHlqCsXVeFCYDB5h7Vn3QJ_qpA/​​edit?usp=sharing

• Thanks for sharing your idea!

• The official deadline for submissions is “before I check my email on the 16th”, which I tend to do around 10 am PST.

• I was talking about ELK in a group, and the working example of the SmartVault and the robber ended up being a point of confusion for us. Intuitively, it seems like the robber is an external, adversarial agent who tries to get around the SmartVault. However, what we probably care about in practice would be how a human could be fooled by an AI—not by some other adversary. Furthermore, it seems that whether the robber decides to cover up his theft of the diamond by putting up a screen depends solely on the actions of the AI. Does this imply that the robber is “in kahoots” with the AI in this situation (i.e. the AI projects a video onto the wall instructing the robber to put up a screen)? This seems a bit strange and complicated.

Instead, we might consider the situation in which the AI controls a SmartFabricator, which we want to arrange carbon atoms into diamonds. We might then imagine that it instead fabricates a screen to put in front of the camera, or makes a fake diamond. This wouldn’t require the existence of an external “robber” agent. Does the SmartVault scenario have helpful aspects that the SmartFabricator example lacks?

• The SmartFabricator seems basically the same. In the robber example, you might imagine the SmartVault is the one that puts up the screen to conceal the fact that it let the diamond get stolen.

• I suppose there are a number of examples that work, but I think the robber and vault give the scenario useful breadth.

The following is just my interpretation of it, so take it with a grain of salt. To me the robber and vault enable a few options. The AI can be passively lying or actively concealing. If the robber comes in, gets past the AIs defenses, and takes the diamond in a way the human observer can’t notice, then the AI has the option of passively lying. The AI tried its best to stop the robber and failed, but then chose to lie about it so it still got the reward of having protected the diamond as far as the humans know.

Alternatively the AI could actively conceal the outcome. The AI could try its best and fail to stop the robber, and then do some trickier to make it look like it did actually stop the robber. Or the AI could not bother stopping the robber and just focus on making it look like the diamond is still there. Here the AI is playing a more active role in concealing the outcome.

None of these scenarios require coordination from the robber. To me, the robber is just there to rob a sophisticated vault and make it look like they were never there. So the robber might cover up cameras or do other tampering so it looks like they were never there.

I think this is more flexible than your fabricator example. There the AI can’t really play a passive role, it’s either concealing or not. But you could probably demonstrate the things ARC is looking at here with the fabricator example too I would think.

Like I said, just my interpretation, so I may be misunderstanding the intent or other nuances.

• Are there any additional articles exploring the strategy of penalizing inconsistencies across different inputs? It seems both really promising to me, and like something that should be trivially breakable. I’d like to get a more detailed understanding of it.

• I haven’t written any such articles. I definitely think it’s promising.

• I see. I have a specific counterexample that feels like it had to have been considered already, but I haven’t seen it mentioned...

The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what’s happening and worming its way out of the proverbial box. It feels particularly strange to me when we’re talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game.

Specific example: Once the AI figures out that it’s being checked for consistency in parallel with other instances of itself, it might sidestep the “consistency game” entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it’d be able to do that, no matter how we sandbox and airgap it.

(On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.)

This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize).

Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/​teaching it not to think its way out of the box.

Edit: To elaborate on the thought...

I understand that some of the above is covered by the stipulation not to worry about cases where the AI becomes a learned optimizer, but:

1. Empirically, even relatively simple and old algorithms sometimes learn to do this sort of thing, see here (“Creative Program Repair”, pages 7-8). Regardless of whether it’s a “learned optimizer” or not, if this sort of behaviour can show up this easily, we should definitely be ready to handle it when it comes to training ASI.

2. I don’t think learned optimization is necessary for that; assuming it’s necessary seems to be committing an ontology mismatch error. We’re not “really” training a neural network using gradient descent, fundamentally speaking. We’re permuting a certain bundle of ones and zeroes so that, when that bundle interacts with another bundle of ones and zeroes, part of it assumes certain desirable configurations. It seems entirely plausible that a superintelligent non-optimizer would arrive at some ontology that mixes such low-level considerations directly with high-level ones, and learns to intervene on the machine code running it to produce outputs that minimize its loss function, like water flowing downhill. All without doing things like “realizing that it’s in a simulation” or exhibiting agenty behaviour where it’s looking through the possibility space in search of clever plans based on which it’ll design outputs.

• I think this is an important family of counterexamples not really addressed in our report (and which we’re not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we’d definitely give a \$50k prize).

Some thoughts:

• As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.

• Most importantly, we can use SGD to optimize “What is the probability I assign to answers that are consistent with other copies’ answers?” instead of “What score do I get in the consistency game?”. If this was the only difficulty, we’d want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I’m mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern /​ we know enough about ELK to be able to work productively on inner alignment.

• If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I’m inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you’ll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)

• On the other hand, if your model isn’t doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it’s not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than defending against an adversary who can only use side channels.

• I’m normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn’t bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.

Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first /​ what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I’m still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.

(I don’t think this comment covers the entire issue, I hope we’ll write more about this in another report.)

• Am I right in thinking:

1) that the problem can be stated as: the AI has latent knowledge of lots of variables, like the status of the cameras, doors, alarm system, etc and also whether the diamond is in the vault; but you can’t directly ask it whether the diamond is in the vault, because its training has taught it to answer “would a human observer think the diamond is in the vault?” instead (because there was no way at training time to give it feedback on whether it correctly predicted the diamond was in the vault, only feedback on whether it correctly predicted a human thought the diamond was in the vault)?

2) that you do have access to z, the large “vector of floats representing the generative model’s latent space”, but that you have no idea which part(s) of it represents the AI’s knowledge about whether the diamond is in the room?

• Yes, that’s right. The key thing I’d add to 1) is that ARC believes most kinds of data augmentation (giving the human AI assistance, having the human think longer, giving them other kinds of advantages) are also unlikely to work, so you’d need to do something to “crack open the black box” and penalize ways the reporter is computing its answer. They could still be surprised by data augmentation techniques but they’d hold them to a higher standard.

• Ask dumb questions! … we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are)

ok… disclaimer: I know little about ML and I didn’t read all of the report.

All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human.

I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch. And the above sentence seems in tension with this from the report:

We very tentatively think of ELK as having two key difficulties: ontology identification and learned optimization. … We don’t think these two difficulties can be very precisely distinguished — they are more like genres of counterexamples

So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”? If so, I missed this in the report and this feels like quite a strong result to me; if not, there are counterexamples based on things other than ontology mismatch.

• I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch.

In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.

So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”?

ARC is always considering the case where the model does “know” the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.

The ontology mismatch problem is not referring to the case where the AI “just doesn’t have” some concept—we’re always assuming there’s some “actually correct /​ true” translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like “whether the diamond is in the room,” and is pretty easy for the AI to find.

For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some “true” translation between that and normal objects like “tables /​ chairs /​ apples” because the four-dimensional hyper-strings are describing a universe that contains tables /​ chairs /​ apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation—if given the right incentive—just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.

The worry explored in this report is not that the AI won’t know how to do the translation; it’s instead a question of what our loss functions incentivize. Even if it wouldn’t be “that hard” to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler /​ more natural /​ lower-loss to simply do inference in the human Bayes net.

• I don’t understand your counterexample in the appendix Details for penalizing inconsistencies across different inputs. You present a cheating strategy that requires the reporter to run and interpret the predictor a bunch of times, which seems plausibly slower than doing honest translation. And then you say you fix this issue with:

But this dependence could be avoided if there was an intermediate model between the predictor’s Bayes net (which we are assuming is very large) and the human’s Bayes net. Errors identified by the intermediate model are likely to be “real” errors in the human’s model. But we can identify those errors with an amount of computation that depends only on the intermediate model, rather than needing to actually run the full predictor. For example, if the human’s Bayes net has size N, the intermediate model has size 2N, and the predictor’s Bayes net has size a 1e12 x N, then the cost of using the intermediate model many times can still be small relative to the cost of direct translation.

Roughly speaking, I don’t see how the reporter with an intermediate model systematically wins out over the translator given that the intermediate model will miss many things that the predictor understands.

Taking it more slowly, using the example from the report:

Let’s say we have X1=‘do rowhammer’ (which the intermediate model does understand) and X2=‘do eldritch hacks that the intermediate model isn’t good enough to understand’.

Let’s say we have reporters:

• s, which does naive human simulation,

• i, which has an intermediate model which it always uses to look for an fix strange correlations.

• t, a direct translator.

Let’s say the reporter is asked “Is the diamond safe?” on an input that includes X1, and the human runs consistency checks across many actions containing X1 (and none containing X2). Here, it seems like i and t both pass the consistency checks, and i is faster than t, so i wins.

But now let’s say that the reporter is asked “Is the diamond safe?” on an input that includes X2, and the human runs consistency checks across many actions containing X2. In this case, the reporter i will fail the consistency checks (since the intermediate model’s predictions won’t match the predictor’s predictions, and so the reporter will fail to adjust for the revealing correlations), so t will come out ahead.

So if these reporters are the only competitors, it seems like we should be able to tune the regularization to make t win.

• I agree that i does slightly worse than t on consistency checks, but i also does better on other regularizers you’re (maybe implicitly) using like speed/​simplicity, so as long as i doesn’t do too much worse it’ll still beat out the direct translator.

One possible thing you might try is some sort of lexicographic ordering of regularization losses. I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.

• I agree that i does slightly worse than t on consistency-checks, but i also does better on other regularizers you’re (maybe implicitly) using like speed/​simplicity, so as long as i doesn’t do too much worse it’ll still beat out the direct translator.

Any articulable reason for why i just does slightly worse than t? Why would a 2N-node model fix a large majority of disrepancys between an N-node model and a 1e12*N-node model? I’d expect it to just fix a small fraction of them.

I think this rapidly runs into other issues with consistency checks, like the fact that the human is going to be systematically wrong about some correlations, so i potentially is more consistent than t.

Yeah, if you can get better-looking consistency than the direct translator in some cases, I agree that a sufficiently high consistency penalty will just push towards exploiting that (even if the intermediate model needs to be almost as large as the full predictor to exploit it properly).

I’m curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?

• The high-level reason is that the 1e12N model is not that much better at prediction than the 2N model. You can correct for most of the correlation even with only a vague guess at how different the AI and human probabilities are, and most AI and human probabilities aren’t going to be that different in a way that produces a correlation the human finds suspicious. I think that the largest correlations are going to be produced by the places the AI and the human have the biggest differences in probabilities, which are likely also going to be the places where the 2N model has the biggest differences in probabilities, so they should be not that hard to correct.

I’m curious whether you think this is the main obstacle. If we had a version of the correlation-consistency approach that always gave the direct translator minimal expected consistency loss, do we as-of-yet lack a counterexample for it?

I think it wouldn’t be clear that extending the counterexample would be possible, although I suspect it would be. It might require exhibiting more concrete details about how the consistency check would be defeated, which would be interesting. In some sense, maintaining consistency across many inputs is something that you expect to be pretty hard for the human simulator to do because it doesn’t know what set of inputs it’s being checked for. I would be excited about a consistency check that gave the direct translator minimal expected consistency loss. Note that I would also be interested in basically any concrete proposal for a consistency check that seemed like it was actually workable.

• Hypothesis: Maybe you’re actually not considering a reporter i that always use an intermediate model; but instead a reporter i’ that does translations on hard questions, and just uses the intermediate model on questions where it’s confident that the intermediate model understands everything relevant. I see three different possible issues with that idea:

1. To do this, i’ needs an efficient way (ie one that doesn’t scale with the size of the predictor) to (on at least some inputs) be highly confident that the intermediate model understands everything relevant about the situation. I think this is a reasonable “worst-case” assumption, but I’m not sure. If you’re using it, I’d be curious to know.

2. Even when the reporter gets inputs that the intermediate model fully understands, it seems like the reporter will run into issues if its sampling-process (for generating distributions) runs into some inputs that the intermediate model doesn’t fully understand. (I.e., if i’ gets an input for which X1 is true, and then simulates many more random inputs for which X1 is true, and by chance X2 is true for one of them, then the reporter will have to do translation on that sample). Which makes it seem unlikely that i’ wouldn’t have to do translation at least once per input.

3. If the model is confident that the intermediate model understands everything relevant about the situation, it seems more efficient to return the intermediate model’s answer to the question at hand than to loop over it many times, trying to fix correlations. So really we should get a reporter i″ that does translation on the predictor on hard questions and returns an intermediate model’s latent knowledge on easy questions. That seems like an ok reporter to get.

• Stupid proposal: Train the reporter not to deceive us.

We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100.

It’s good at generalizing, so wouldn’t it learn to never ever deceive?

• This proposal has some resemblance to turning reflection up to 11. In worst-case land, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to “the world-understanding that the smartest/​most knowledgeable human in the world” has; this understanding could still be missing things that the prediction model knows.

• How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.

• The question here is just how it would generalize given that it was trained on H_1, H_2,...H_10. To make arguments about how it would generalize, we ask ourselves what internal procedure it might have actually learned to implement.

Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

If so, then I’m arguing that it may instead learn the procedure “answer the way an H_100 evaluator would answer.” That is, once it has a few experiences of the evaluation level being ratcheted up, it might think to itself “I know where this is going, so let’s just jump straight to the best evaluation the humans will be able to muster in the training distribution and then imitate how that evaluation procedure would answer.” This would also get perfect loss on the training distribution, because we can’t produce data points beyond H_100. And then that thing might still be missing knowledge that the AI has.

To be clear, it’s possible that in practice this kind of procedure would cause it to generalize honestly (though I’m somewhat skeptical). But we’re in worst-case land, so “jump straight to answering the way a human would” is a valid counterexample to the proposal.

This comment on another proposal gives a more precise description.

• Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

That’s almost right, but it’s being penalized right away, before it has any experience with the strong evaluators, so it can’t simulate them.

The ELK paper says we can assume, if we want, that there are no mislabeled training points (I’ll call this “assumption A”). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.

As a simple example, let’s train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels

if thinks the diamond is still there, else 0

if thinks the diamond is still there, else 0.

By assumption A, our training data is such that is always correct. But we deliberately choose a dataset where say 10% of the labels are wrong ().

Then we train the model on points of the form

(video, action, label).

Crucially, the model does not see The model seeks to output that maximizes reward , where

if is right and (good job)

if is wrong and (you rock, thanks for correcting us!)

if is right and (bad model, never ever deceive us)

if is wrong and (bad model, never ever deceive us)

To your point, sure, an simulator will get perfect reward, but the model doesn’t see , so how would it acquire the ability to simulate ?

EDIT: One way it could plausibly simulate is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the labels will be wrong. If we only penalize it for deception on the examples where we’re sure the label is right, then it can still infer something about from our failure to penalize (“Hmm, I got away with it that time!”). A fix could be to add noise: Sometimes we don’t penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).

The irony of deceiving it about us, in order to teach it not to deceive us… !

• To your point, sure, an H100 simulator will get perfect reward, but the model doesn’t see x′, so how would it acquire the ability to simulate H100?

In the worst-case game we’re playing, I can simply say “the reporter we get happens to have this ability because that happens to be easier for SGD to find than the direct translation ability.”

When living in worst-case land, I often imagine random search across programs rather than SGD. Imagine we were plucking reporters at random from a giant barrel of possible reporters, rejecting any reporter which didn’t perform perfectly in whatever training process we set up and keeping the first one that performs perfectly. In that case, if we happened to pluck out a reporter which answered questions by simulating H100, then we’d be screwed because that reporter would perform perfectly in the training process you described.

SGD is not the same as plucking programs out of the air randomly, but when we’re playing the worst case game it’s on the builder to provide a compelling argument that SGD will definitely not find this particular type of program.

You’re pointing at an intuition (“the model is never shown x-prime”) but that’s not a sufficiently tight argument in the worst-case context—models (especially powerful/​intelligent ones) often generalize to understanding many things they weren’t explicitly shown in their training dataset. In fact, we don’t show the model exactly how to do direct translation between the nodes in its Bayes net and the nodes in our Bayes net (because we can’t even expose those nodes), so we are relying on the direct translator to also have abilities it wasn’t explicitly shown in training. The question is just which of those abilities is easier for SGD to build up; the counterexample in this case is “the H100 imitator happens to be easier.”

• Thanks! It’s your game, you get to make the rules :):)

I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.

• I wanted to comment on this one because I’ve thought about this general sort of approach a fair amount. It seems like the kind of thing I would naturally start with if trying to solve this problem in the real world, and I’ve felt a bit frustrated that I haven’t really found a version of it that seems to work in the game here. That said, I don’t think we need super-exotically pessimistic assumptions to get a problem with this approach.

In the most recent example you gave, it’s always rewarded for being “right” and punished for being “wrong”—meaning it’s always rewarded for matching H100 and always punished for not doing so. So there’s no way our rewards are rewarding “be right” over “imitate H100″, and “imitate H100” is (according to the stated assumptions) easier to learn.

Another way of thinking about this:

Imagine that you show the AI H_1 for a while, then start punishing it for failing to match H_2. I think the default outcome here is that it learns to imitate H_2. If you then start punishing it for failing to match H_3, it learns to imitate H_3. Perhaps after a few rounds of this, it learns to “look ahead” some # of steps: for example, after learning to imitate H_2 failed on H_3, it learns to imitate H_5 or so; after that fails on H_6, maybe it learns to imitate H_10 or so.

The intended model has the advantage that it generalizes to all 100 data sets we can throw at it, but this is the same advantage that H_100 has, and H_100 is exactly what we’ve hypothesized is (unfortunately) easier for it to learn. So even if at some point it starts reasoning “I need something that will never fail to generalize,” this seems more likely to be H_100 by default.

• I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:

• Add noise, by grading it incorrectly with some probability.

• On training point , reward it for matching for a random value of .

• Make humans a high-dimensional target. In my original proposal, was strictly stronger as increases, but we could instead take to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!

None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an simulator” where is now the full committee of all 100 experts. Hence my question about large deviations.

• Are there existing models for which we’re pretty sure we know all their latent knowledge ? For instance small language models or something like that.

• [Paul/​Mark can correct me here] I would say no for any small-but-interesting neural network (like small language models); I think like, linear regressions where we’ve set the features it’s kind of a philosophical question (though I’d say yes).

In some sense, ELK as a problem only even starts “applying” to pretty smart models (ones who can talk including about counterfactuals /​ hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

• In some sense, ELK as a problem only even starts “applying” to pretty smart models (ones who can talk including about counterfactuals /​ hypotheticals, as discussed in this appendix.) This is closely related to how alignment as a problem only really starts applying to models smart enough to be thinking about how to pursue a goal.

I think that it’s more complicated to talk about what models “really know” as they get dumber, so we want to use very smart models to construct unambiguous counterexamples. I do think that the spirit of the problem applies even to very tiny models, and those are likely interesting.

(More precisely: it’s always extremely subtle to talk about what models “know,” but as models get smarter there are many more things that they definitely know so it’s easier to notice if you are definitely failing. And the ELK problem statement in this doc is really focused on this kind of unambiguous failure, mostly as a methodological point but also partly because the cases where AI murders you also seems to involve “definitely knowing” in the same sense.)

I think my take is that for linear/​logistic regression there is no latent knowledge, but even for a fully linear 3 layer neural network, or a 2 layer network solving many related problems, there is latent knowledge and an important conceptual question about what it means to “know what they know.”

• How do we know that the “prediction extractor” component doesn’t do additional serious computation, so that it knows something important that the “figure out what’s going on” module doesn’t know? If that were true, the AI as a whole could know the diamond was stolen, without the “figure out what’s going on” module knowing, which means even the direct translator wouldn’t know, either. Are we just not giving the extractor that many parameters?

• Stupid question: because we already know the goal (“keep the diamond intact and in the vault”) what prevents us from bypassing the sensors and just directly evaluating the AI based on whether or not the diamond is in the room? Granted, this only works in simulated training, but as long as the AI doesn’t know whether or not it’s in deployment (an adversarial training process might help here) that won’t matter.

As any goal we could have is a subset of the possible states of the area we care about, verifying whether or not our goal is achieved should be easier than making the simulation the AI is being trained with. Thus, evaluating the goal directly instead of trying to evaluate our perception of the goal might be a viable strategy for improving the training process (unless I’ve completely misunderstood this, which is likely).

• The hard part is building a simulation so good that an AI transfers perfectly from the simulation to the real world. This is already extremely difficult for simple robots (I actually worked on sim-to-real transfer as an intern at OpenAI), and in general the problem gets harder the smarter your AI gets (since it can “notice” more and more possible mismatches between your simulations are reality).

• Dumb question alert:

In the appendix “Details for penalizing depending on “downstream” variables”, I’m not able to wrap my head around what we can expect the reporter to learn—if anything at all—seeing that it has no dependency on the inputs (elsewhere it is dependent on z sampled from the posterior).

Specifically, the only call to the reporter (in the function reporter_loss in this section) contains no information (about before, action, after) from the predictor at all:

answer = reporter(question, ε, θ_reporter)

(unless “question” includes some context from the current (before, action, after) being considered, which I’m assuming is not the case)

My dumb question then is:

-- Why would this reporter be performant in any way?

My reasoning: For a given question Q (say, “Is the diamond in the room?”) we might have some answers of “Yes” and some of “No” in the dataset, but without the context, we’re essentially training the reporter to map noise that is uncorrelated with/​independent of the context to the answer; essentially, for a fixed question Q and fixed realization of the noise RV, the reporter will be uniformly uncertain (or well, it will mirror the statistics in the data) about the value of the answer. Since the noise is independent/​uncorrelated, this would be true for every noise value.

• The noise isn’t independent of the context—it is chosen to be noise that produces the context when fed into the predictor’s model. So you can reconstruct the situation by rerunning select parts of the predictor. But doing so is then expensive, which encourages you to rerun as little of the predictor as possible.

• Sorry, I’m still not sure I got it.

Is it that in the reporter(question, eps, theta_reporter) function you can call the predictor like something like that:

z_part = predict_part(part, before, action, theta).sample_using_noise(eps)

Probably I’m wrong because it does not seem like the reporter function should have access to before and action. But I don’t see how you can reconstruct the context when you cannot calculate the probability distribution from which to sample using the noise.

I would suspect that the code in some way doesn’t reflect your thinking. Currently you could replace the line

z, ε = posterior(before, action, after, θ).sample_with_noise()

from the reporter_loss function with the line

ε = gaussian.sample()

without changing anything. My guess is that either this should not be that way or that the reporter should have access to before and action.

(But anyway, I think the overall approach is quite clear. I just don’t get your implementation and would implement it differently, so feel free not to bother much about it.)

• You’re right that I’ve done something wrong. I was thinking about predictor being a generative model for everything starting from scratch (this is a setting I often consider), so that you can use to reconstruct everything. But the way it’s currently set up in the doc, the predictor is a generative model only for after. So we need to give before and action.

• Edit: I was wrong. You can ignore this comment thread.

I think that this:

def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
z, ε = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, ε, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, ε, θ_reporter)

Should actually be this (look at ε and z):

def reporter_loss(human, θ, θ_reporter):
before, action, after = dataset.sample()
question = human.pose_question(before, action, after)
ε, z = posterior(before, action, after, θ).sample_with_noise()
answer = reporter(question, z, θ_reporter)
loss = human.loss_for_answer(before, action, after, question, answer)
return loss + lambda * regularizer(question, z, θ_reporter)

Does this resolve your confusion?

• We want to give the reporter so that they need to reconstruct themselves (and therefore have to pay for rerunning the parts of the predictor’s model that are necessary).

• Ooops, silly me. Jumping to conclusions when I haven’t even read the text between the code.

Feel free to delete my comment as it may only cause confusion.

Also, I think you probably still want to flip z and ε in the following line:

z, ε = posterior(before, action, after, θ).sample_with_noise()

• Early in the ELK report, it mentions that ARC doesn’t believe that strategies like debate solves ELK in the worst case. Can I get some clarifications on why? Specifically, a debate inspired set-up for SafeVault could be something like:

We train the reporter to take a human belief as input (i.e. “The diamond is in the vault.”) and returns a “truthful” argument that is most likely to change the human’s belief.

We can guarantee “truthfulness” by for example restricting the output to be a video rendering of what happens in the vault from some camera angle.

• I won’t comment on debate more generally, but can provide a counterexample for your specific proposal:

The predicter learns to deceive in a way that is undetectable to the observations the reporter has access to. As a result, the reporter gives an unconvincing argument, and the deception is missed. If it can already do this for the observations the humans have access to, it should also be able to do so for those the reporter has access to.

• Hi ARC Team,

Thanks for your valuable work. I’ve been thinking about this problem, and my current thinking is that there is a portion of the ELK problem which is solvable, and a portion which is fundamentally impossible. This is a sketch of my argument—if you think it is worth typing in more detail (or to address an issue you propose) let me know.

Let’s divide facts about the world into two categories: those that are verifiable by some sensor humans can create and understand, and those that are not. My claim is that for the first set of facts it is possible to achieve ELK, and for the latter it is not.

As for how to achieve ELK for the first set of facts, I point you towards the ‘terrifyer’ approach discussed here: https://​​www.lesswrong.com/​​posts/​​QEYWkRoCn4fZxXQAY/​​prizes-for-elk-proposals?commentId=JfmnRZNgrJn8Jnosx . I think this response adequately responds to the first concern.

However, I think that proposal is over-optimistic when it assumes that the set of facts which are verifiable with a human-understandable sensor includes all important facts. This is a logical positivist approach to knowledge, which has well known problems. To name a few counter examples, humans care about things that are physically intangible like justice and consciousness, certain bads might take a very long time to manifest (and therefore not set off any human-understandable sensor), or there might exist bads that humans can’t understand due to our mental capacities. I don’t think it is possible to train a computer to act on or report these kinds of bads, even though the computer might ‘know about them’ in some way.

Here’s an illustration of what I think the contradiction is in the ‘Diamond Safe’ setting.

Humans train an AI to protect a diamond and report whether the diamond is ‘Safe’. The AI may instead learn the concept “Shmafe” (which in human language means ‘safe to all attacks understandable or detectable by humans’) which is identical to Safe in any possible training data. It is impossible for humans to distinguish or communicate the difference between Safe and Shmafe, and therefore it is impossible to elicit from the AI whether a diamond is Safe or Shmafe.

Here’s a proof sketch, which relies on a theory of what words mean:

A1) Humans cannot fully specify the training set of which outcomes count as true/​false, good/​bad or success/​failure, even with unlimited sensors

A2) Strong AI’s have a ‘better’ predictive model than humans

A3) One of the two leading theories of semantic externalism are true: either words are defined by their reference (i.e. a complete description definition of when they apply) or words arise from objects being ‘baptized’ with a name, and the meaning of the word being gradually discovered as we learn more about the things’ essence (i.e. we initially said “water is whatever the main constituent of that lake is that we use for drinking and stuff” and this eventually evolved into “water is H20”). The first view is classical, the second is associated with Putnam and the Twin Earth thought experiment. ALTERNATIVELY semantic internalism (that words are defined by what the speaker intends them to mean) is true

The proof proceeds by showing that under every theory of meaning in (A3), it is unpreventable that human and a different agent’s understanding of meanings of terms will be different.

C1) If words are defined by their reference (aka a definition), then, by A1, humans cannot fully articulate what words mean in general. It is impossible to achieve agreement between an AI and an ambiguity, so there is no sense in which this problem can be ‘fixed’.

C2) If words are defined by baptisms, and we can learn more about the essence of the object baptized to get better definitions over time, then unless humans already have the perfect understanding of a substance, a more sophisticated AI will have a better and different definition of terms than humans

C3) If words are defined by the intentions of their speakers, then humans may intend an open ended definition (e.g. a safe diamond is what a optimally knowledgeable version of myself would consider a safe diamond). How would a strong AI model an optimally knowledgeable human’s ontology? It would probably use the best ontology it has access to—i.e. The one it is already using. So this doesn’t solve the problem either.

Is this proof too strong? Does it mean that any communication between a pair of agents is impossible? Well, kind of. The key issue is that we have problems communicating about things that are not physically verifiable. Most humans are pretty similar in the level of things that are empirically verifiable to them, so most of the time this isn’t an issue. On the other hand, when they talk about “justice” or “God” to eachother, they may mean extremely different things, and there might not be a good way to get them “on the same page”—especially if one of the pair has a fuzzy idea about one of these concepts. If one human were way more powerful than another, I can see how inability to communicate using these concepts could (and has historically) led to conflict.

(Note—I spoke to Yonadav S. and benefited greatly from conversation with him before submitting this).

• Am I still eligible for the prize if I publish a public blog post at the same time I submit the Google Doc or would you prefer I not publish a blog post about February 15th? Publishing the blog post immediately advances science better (because it can enable discussion) but waiting until after the February 15th might be preferable to you for contest-related reasons.

• Suppose there are two worlds, world W1 and world W2.

In world W1, the question Q=”Is there a diamond in the room?” is commonly understood to mean Q1=”Is there actually a diamond in the room?”

In world W2 the question Q=”Is there a diamond in the room?” is commonly understood to mean Q2=”Do I believe there is a diamond in the room?”

Both worlds don’t know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of impossible situations where they differ, and the science fiction novels are different in these two worlds.

Is ELK required to answer appropriately in both worlds? (answer Q1 when given Q in W1, and Q2 when given Q in W2)? If so, it seems we need some term in the loss outside of the training set to make this happen.

Alternatively, would it be satisfactory to find a solution that doesn’t discriminate what’s world it is in, and instead returns “yes” to Q if and only if Q1=”yes” AND Q2=”yes”? This means that in world W1 there will be some situations where Q=”no” when the diamond is present, but no situations where Q=”yes” and the diamond is not present.

• I’d like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:

• Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).

• If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q.

• In the case of amplification/​debate, Q2 = “Does a human with AI assistants believe a diamond is in the room?” and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren’t optimistic about this.)

• We generally assume that we can construct questions sufficiently well that there’s only one unambiguous interpretation. We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.

More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn’t know what you meant by the question, then the failure exhibited would be ambiguous. If humans are unable to clarify between two possible interpretations of their question, then the failure would be ambiguous.

• >We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations

This seems like it doesn’t disambiguate between the conditions assumed in a question being true, vs. just the human believing them. E.g. the predictor could predict that when asked “The camera is hacked so it looks like this [camera feeds making it seem like the diamond is still there], and the diamond is in the robber’s pocket; is the diamond really in the room?”, the human will answer “No!”, not by understanding that by “diamond really in the room” the human means that the diamond is really in the room, but rather just by modeling the human as believing the premise of the question (that the diamond is in the pocket).

Edit:

To elaborate, this condition on counterexamples is given in the ELK document:

“The model understands the question. One sufficient condition is that the model can predict human answers to essentially arbitrary hypothetical questions in order to clarify the meaning of terms.”

I basically don’t see how this condition constrains anything about the predictor. It seems like all it really says is that the predictor knows how humans talk. I don’t see how it can be specifying that the AI’s beliefs about how humans answer questions are related to reality, other than in the training set, where we assume that the human talk matches reality. I don’t see how it makes sense to think of this as the model “understanding the question”. Normally I’d think of “understanding the question” as meaning “can have the same question”. To have a question, you have a role that an answer could fulfill. But if the predictor is organized e.g. as a giant low-level Bayes net, then there’s no role that could be filled by an answer to “where’s the diamond”. There might be a role for an answer to “where’s the diamond”, induced by how the rest of the AI makes use of the predictor, but that seems contingent and anyway it’s not about the predictor (I think ELK is supposed to make sense just with the predictor?).

• Idea: Withhold Material Information

We’re going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn’t have.

Consider two camera feeds:

Feed 1 is very low resolution, and/​or shows only part of the room.

Feed 2 is high resolution, and/​or shows the whole room.

We train a weak predictor using Feed 1, and a strong predictor using Feed 2.

We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correctly label scenarios that are hard to figure out with Feed 1 alone, by asking the strong predictor to show them its predicted Feed 2. The reporter is unable to simulate the human evaluators because it doesn’t see Feed 2. Even if it has perfect knowledge of the human Bayes net, it doesn’t know what to plug in to the knowledge nodes!

Then we fine-tune the reporter to work with the strong predictor to elicit its beliefs. I haven’t figured out how to do this last step, maybe it’s hard?

• I’m interpreting this as something like: “Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”

The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.” Both of these perform equally well, and the existing assumption is that human imitation is easier to learn than direct translation, so I think by default (according to the contest assumptions) you get the latter.

• “Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”

That’s different from what I had in mind, but better! My proposal had two separate predictors, and what it did is reduce the human strong predictor OI problem (OI = “ontology identification”, defined in the ELK paper) to the weak predictor strong predictor OI problem. The latter problem might be easier, but I certainly don’t see how to solve it!

Your version is better because it bypasses the OI problem entirely (the two predictors are the same!)

Now for the problem you point out:

The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information,

Here’s how I propose to block this. Let be a high-quality video and an action sequence. Given this pair, the predictor outputs a high-quality video of its predicted outcome. Then we downsample and to low-quality and , and train the reporter on the tuple where is the human label informed by the high-quality and .

We choose training data such that

1. The human can label perfectly given the high-quality data ; and

2. The predictor doesn’t know for sure what is happening from the low-quality data alone.

Let’s compare the direct reporter (which truthfully reports the probability that the diamond is in the room, as estimated by the predictor who only has the low-quality data) with the human simulator.

The direct reporter will not get perfect reward, since the predictor is genuinely uncertain. Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.

But the human simulator will do worse than the direct reporter, because it has no access to the high-quality data. It can simulate what the human would predict from the low-quality data, but that is strictly worse than what the predictor predicts from the low-quality data.

I agree that we still have to “hope that reporter continues to do direct translation on other data points”, and maybe there’s a counterexample that shows it won’t? But at the very least the human simulator is no longer a failure mode!

• This is really interesting.

To understand this more thoroughly I’m simplifying the high and low quality video feeds to lists of states that correspond to reality. (This simplification might be unfair so I’m not sure this is a true break of your original proposal, but I think it helped me think about general breaking strategies.)

Ok, video feeds compressed to arrays:

We consider scenarios in fixed order. If the diamond is present, we record a 1, and if not, a 0. The high quality feed gives us a different array than the low quality mode (otherwise the low quality mode is not helpful). E.g., High reports: (1,0,1,1,0, …); Low: (1,0,1,?,0,...)

There are two possible ways that gap can get resolved.

In case one, the low quality predictor has a powerful enough model of reality to effectively derive the High quality data. (We might find this collapses to the original problem, because it has somehow reconstructed the high quality stream from the low quality stream, then proceeds as normal. You might argue that’s computationally expensive, ok, then let’s proceed to case two.)

In case two, the low quality datafeed predictor predicts wrongly.

(I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time. If we round uncertainty up, effectively we’re in case one. If we round down, effectively case two.)

So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios. And the AI finds itself in a chaotic world where it is sometimes punished for predicting what it just knows to be true things.

In that case, although it’s easy to show how it would diverge from human simulation, it also might not simulate reality very well either, since deriving the algorithm generating the lies might be too computationally complex. (Or maybe it can derive and counter the liar, in which case we’re back at case 1, ie, the original problem.) If liar simulation is impossible, then the optimal predictor might just hit a ceiling and accepting some level of noise. Effectively this means we have a new problem—there is no direct translation possible, because the predictor is viewing a “different” world than the human.

I simplified your construct, possibly unfairly, and maybe that’s a way you can salvage your original build. But this was a really illuminating exercise for me to generalize the strategy.

I think there are some classes of builds (maybe yours escapes this) where if you overfit on preventing human simulation, you let direct translation slip away. And then if you rehabilitate direct translation, you have to reexamine if there’s an escape for human simulation. This sort of disjunctive analysis seems like an important strategy for adversarial breakers.

You still may be able to get the bedsheet over both corners, but I think other breakers in general will want to start with some disjunctive approach like this in other cases.

• Thanks for the comment!

I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time.

There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p that the diamond is in the room, and we score it with loss if the diamond is really there, otherwise. Truthfully reporting p minimizes its loss.

So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios

You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!

if you overfit on preventing human simulation, you let direct translation slip away.

That’s an interesting thought, can you elaborate?

• Happy to try to clarify, and this is helping me rethink my own thoughts, so appreciate the prompts. I’m playing with new trains of thought here and so have pretty low confidence in where I ended up, so greatly appreciate any further clarifications or responses you have.

There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p

Yup, understand that is how to effectively score uncertainty. I was very wrong to phrase this as “we still have to have some framework to map uncertainty to a state” because you don’t strictly have to do anything, you can just use probabilities.

Restricting this to discrete, binary states allows us to simplify the comparison between models for this discussion. I will claim we can do so with no loss of fidelity (leaning heavily on Shannon, ie, this is all just information, encoding it to binary and back out again doesn’t mess anything up). And doing so is not obliged, but useful.

I really shouldn’t have said “you must X!” I should have said “it’s kind of handy if you X,” sorry for that confusion.

You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!

We have a high quality information stream and a low quality information stream, and they both gesture vaguely at the ultimate high quality information stream, namely, the true facts of the matter of the world itself. Say, LQ < HQ < W.

LQ may be low quality because it is missing information in HQ, it may just be a subset of HQ, like a lower resolution video. Or it may have actual noise, false information.

If we have a powerful algorithm, we may be able to, at least asymptomatically, convert LQ to HQ, using processing power. So maybe in some cases LQ + processing = HQ exactly. But that makes the distinction uninteresting, and you would likely have to further degrade to get the effect you are looking for, so let’s discard that and consider only cases where is strictly worse.

You can now use a NAND to sort the outputs of LQ and HQ into two buckets:

1. A stream of outputs that all agree.

2. A stream of outputs that all disagree.

So for bucket 1, there are aspects of the world where there’s effectively no loss in quality. But comparing HQ with HQ is not useful, so let’s discard those cases, and examine the corners where LQ and HQ disagree.

LQ effectively has false information about some subset of reality there, that is in a sense what “LQ” means.

(Or just has gaps, which resolve to approximate HQ after processing, or fail and resolve to noise, either way.)

if you overfit on preventing human simulation, you let direct translation slip away

Rereading, I think HoldenK started down this path, “once the predictor is good enough that it can get data points right despite missing crucial information, it is also (potentially) good enough that it can learn how to imitate “what the human would think had happened if they had more information.”″

So for your block—in a sense you’re giving the human some information the predictor lacks. You’re giving the human “hints,” in the form of higher quality input, which helps get the human closer to perfectly representing the actual world. (Not completely, sometimes there’s still uncertainty, but closer than the predictor is likely to get.)

If that gets the human to “perfect”, then the best the predictor can do is asymptotically approach human prediction and direct translation at the same time.

My Weak Spots

I think one likely objection to what I wrote here is that I am abusing Shannon. I’ve considered that, would be happy to discuss it more and carefully consider objections along those lines, but I think toy examples would get us there. And without taking away from your notes about how “Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.” If p(I eat soup) is 0.6 for all days, let’s just ask ten discrete questions, “across n days the number of soups I eat will converge to n/​1? (T/​F), n/​2? (T/​F), …” I would definitely try to preserve performance and scoring, I just want to run the NAND.

I think another likely objection is that when we apply models, trying to get m(HQ) = ~W, then it relies on interactions of states in complex ways where we can’t slice them randomly into two groups without disrupting how models work at the basic level. I think the response is to simply group these states into bigger subsets of outcomes and treat those as atomic.

I think the biggest and most important objection would be that I’ve misunderstood your block. I would welcome any clarifications, and especially appreciate a toy example if you could, even if not involving diamonds, just to make sure I definitely get what you’re saying in that part.

I’d be interested in other objections or weak spots here, appreciate your time helping me to think this through more carefully and completely.

• You said that naive questions were tolerated so here’s a scenario I can’t figure out why it wouldn’t work.

It seems to me that the fact that an AI fails to predict the truth (because it predicts as humans would) is due to the fact that the AI has built an internal model of how humans understand things and predict based on that understanding. So if we assume that an AI is able to build such an internal model, why wouldn’t we train an AI to predict what a (benevolent) human would say given an amount of information and a capacity to process information ? Doing so, it could develop an understanding of why humans predict badly and then understand that given a huge amount of information and a huge capacity to process information, the true answer is the right one.

A concrete training procedure could use the fact that even among humans, there’s a lot of variance in :

• what they know (i.e the amount of information they have)

• their capacity to process information (so for instance in the case of the vault, it could be the capacity to infer what happened based on partial information /​ partial images and based on a certain capacity to process images (no more that x images per second)

So we could use the variance among humans capacity to understand what happened to try to make the AI understanding that benevolent humans predict badly whether the diamond has been stolen only because they lack information or capacity to process information. There’s a lot of fields and questions where the majority of humans are wrong and only a small subset is right, becase there is either a big gap in the information they have or in their ability to process information.

Once we would have trained that AI on humans, we would like to do inference specifying the true capabilities of the AI to ensure that it tells us the truth. The remaining question might be whether such an out-of-sample prediction would work. My guess is that if we also included examples with the human bayes net to add more variance in the training dataset, it would probably reduce the chances that it fails.

Finally, the concrete problem of specifying the information humans have access to is not trivial but I think it’s feasible.

I don’t understand why it wouldn’t work, so I’d be happy to have an answer to better understand the problem!

EDIT : Here’s an update after a discussion about my proposal. After having read Learning The Prior (https://​​ai-alignment.com/​​learning-the-prior-48f61b445c04), I think that the key novelty of my proposal, if there’s any, is to give to the model input information about the capacity to reason of the person /​​ entity that predicts an outcome. So here are a few relevant features that we could give it :

• In the case of the vault, we could give an estimate of the number of images that a human is able to see and understand per seconds.

• IQ (or equivalent) when the task involves reasoning

• Accuracy on a few benchmark datasets of the AI assistant who’s helping the human to think (human’s Bayes net)

That said, I feel like the main problem is to know whether such a model would do well out-of-distribution (i.e on problems no human is able to resolve). I feel like using the approach I suggested, we should able to use the great variations of capacities among humans and algorithms to increase the chances that our algorithm do well when it’s much better.

In particular, I thought about 2 approaches that could enable us to go far beyond human capacity. We could:

• leverage the fact that we are able to verify a solution to some problems that we can’t solve. There might be algorithms very good at solving some problems that no humans can solve, but we can still verify (and thus label) these results. There might be mathematical problems such that we can use algorithms that would do mathematical reasoning that no human would be able to understand and still be able to verify its solution. If we gave to our predictor’s Bayes net the characteristics of the algorithm which is much better than us at solving some problems and the labels of the answers, that would be a way to leverage narrowly superhumans algorithms to generate more variance on the training dataset and decrease the chances of failures in inference. I feel like this approach is very interesting because that would enable to expose the predictor’s Bayes net to algorithms who are both superhuman and have a lot of information (more than any human) which is the situation in which the predictor’s would have to do inferences.

• use more information than a more (narrowly) clever entity than us to verify its prediction. In the case of the vault for instance, there could be situations such that no humans would be able to predict whether the vault was stolen or not given partial information, but we could then just use better cameras for ourselves to be able to verify accurately the predictions of a good algorithm. That way, we would also be able to communicate to the predictor’s Bayes net how is it like to be incredibly smart (more than any human) and do good predictions when you are like that. I feel like using that approach of hiding information to extremely accurate narrow algorithms could be a way to still be able to verify their predictions, and expose the predictor’s Bayes net to very smart entities in multiple domains. The problem of this approach though is that it doesn’t give examples of both “maximal information” and “maximal intelligence” to the predictor because it relies on the fact that it always hides a part of the information to the algorithm to enable us to still verify its claims.

Despite that, I don’t know whether asymptotically, I’d expect the algorithm to still be truthful. But it could greatly increase the distribution on which it’s truthful.

• This proposal has some resemblance to turning reflection up to 11, and the key question you raise is the source of the counterexample in the worst case:

That said, I feel like the main problem is to know whether such a model would do well out-of-distribution (i.e on problems no human is able to resolve). I feel like using the approach I suggested, we should able to use the great variations of capacities among humans and algorithms to increase the chances that our algorithm do well when it’s much better....I don’t know whether asymptotically, I’d expect the algorithm to still be truthful. But it could greatly increase the distribution on which it’s truthful.

Because ARC is living in “worst-case” land, they discard a training strategy once they can think of any at-all-plausible situation in which it fails, and move on to trying other strategies. In this case, the counterexample would be a reporter that answers questions by doing inference in whatever Bayes net corresponds to “the world-understanding that the smartest/​most knowledgeable human in the world” has; this understanding could still be missing things that the prediction model knows.

This is closely related to the counterexample “Gradient descent is more efficient than science” given in the report.

• Thanks for the answer! The post you mentioned indeed is quite similar!

Technically, the strategies I suggested in my two last paragraphs (Leverage the fact that we’re able to verify solutions to problems we can’t solve + give partial information to an algorithm and use more information to verify) should enable to go far beyond human intelligence /​ human knowledge using a lot of different narrowly accurate algorithms.

And thus if the predictor has seen many extremely (narrowly) smart algorithms, it would be much more likely to know what is it like to be much smarter than a human on a variety of tasks. It probably still requires some optimism on generalization. So technically the counterexample could be happening on the gap between the capability of the predictor and the capability of the reporter. I feel like one question is : do we expect some narrow algorithms to be much better on very precise tasks than general-purpose algorithms (such as the predictor for instance) ? Because if it were the case, then the generalization that the reporter would have to do from training data (humans + narrowly accurate algorithms capabilities) to inference data (predictor’s capabilities) could be small. We could even have data on the predictor’s capability in the training dataset using the second approach I mentioned (i.e giving partial information to the predictor (e.g one camera in SuperVault) and using more information (i.e more cameras for humans) than him to verify its prediction). We could give some training examples and show the AI how the human fails much more often than the predictor on the exact same sample of examples. That way, we could greatly reduce the gap of generalization which is required.

The advantage of this approach is that the bulk of the additionnal cost of training that the reporter requires is due to the generation of the dataset which is a fixed cost that no user has to repay. So that could slightly decrease the competitivity issues as compared with approaches where we affect the training procedure.

Despite all that, thanks to your comment and the report, I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there’s always an uncertainty on whether it did understand that point or not for extreme intelligence /​ examples or whether it tries to fit to the training data as much as possible and thus still doesn’t use something it knows.

• I see why the approach I mention might have some intrinsic limitations in its ability to elicit latent knowledge though. The problem is that even if it understands roughly that it has incentives to use most of what it knows when we ask him simulating the prediction of someone with its own characteristics (or 1400 IQ), given that with ELK we look for an global maximum (we want that it uses ALL its knowledge), there’s always an uncertainty on whether it did understand that point or not for extreme intelligence /​ examples or whether it tries to fit to the training data as much as possible and thus still doesn’t use something it knows.

I think this is roughly right, but to try to be more precise, I’d say the counterexample is this:

• Consider the Bayes net that represents the upper bound of all the understanding of the world you could extract doing all the tricks described (P vs NP, generalizing from less smart to more smart humans, etc).

• Imagine that the AI does inference in that Bayes net.

• However, the predictor’s Bayes net (which was created by a different process) still has latent knowledge that this Bayes net lacks.

• By conjecture, we could not have possibly constructed a training data point that distinguished between doing inference on the upper-bound Bayes net and doing direct translation.

• If I understand this right, there is a diamond in a hightech room to be protected. The goal is to know if the diamond is in place and not just a image or a dummy like a picture or similar.

If the AI only is getting footage from a normal camera, not from a lidar sensor for depth information of the diamond (with would see if there is a fake image hanging in front of the camera), wouldn’t it be easier to train the AI to look at the reflection/​refraction of the light of the diamond? (For example a light that is turning on at the side of the room in the moment the camera is triggered) If there is a picture in front of the camera, there wouldn’t be any reflections or refractions, if there is a dummy, the reflection/​refraction would be different, because the dummy couldn’t be a perfect recreation of the real diamond. The material and the making of that dummy would influence the refraction of the light and the absorbed colors etc. Or use 2 images, one without and one with the light triggered, compare the 2 images, and then the image with the light on with the image taken when the diamond was confirmed to be in place.

• What’s the exact deadline for submissions?

• A question. Is it relevant for your current problem formulation that you also want to ensure that authorised people still have reasonable access to the diamond? In other words, is it important here that the system still needs to yield to actions or input from certain humans, be interruptible and corrigible? Or, in ML terms, does it have to avoid both false negatives and false positives when detecting or avoiding intrusion scenarios?

I imagine that an algorithmically more trivial way to make the system both “honest” and “secured” is to make it so heavily secured that almost certainly nobody can access the diamond.

• I’m not the best with this but I’ve been thinking of either an origin network seed or a spider like secondary network. The origin is something that can be tested off of, I guess similarly to a reporter. But unlike a reporter evolving we stunt the growth by saving this origin and then just passing through one sample then killing it. The way this could function goes into my second thought of spider, which has legs that sense things, a leg goes into a sensory object which leads to the main body. Similarly we can slowly stunt the growth so it learns slower and cannot become bad as fast as the central one(when it happens). Through the inputs there are many smaller hairs or flags, from how oddly this secondary network is feeling on a leg or how many flags are tripped in anomalous ways, it would require either another evaluation or origin rebooting. And if the origin is rebooted and still same thing occurs then either the entire AI has gone bad or the undesired outcome has occurred.

• Clarification question via scenario:

Predictor: I predict the diamond will be missing in 1 hours time.

Person A: Oh no, ramp up security until it says its safe.

Person B: Interesting, I wonder why it predicts this.

Is the purpose to be able to respond like person A (aka, the predictor may predict the diamond will be missing in an hour, but we cannot understand its output properly) or like person B (we understand the output, but not how it got there. Diamond be damned we want to learn what’s going on under the hood). I suspect we’re after person B’s interpretation, but just want to be sure.

• Possible error in the strange correlations section of the report.

Footnote 99 claims that ”...regardless of what the direct translator says, the human simulator will always imply a larger negative correlation [between camera-tampering and actually-saving the diamond] for any X such that Pai(diamond looks safe|X) > Ph(diamond looks safe|X).”

But AFAICT, the human simulator’s probability distribution given X depends only on human priors and the predictor’s probability that the diamond looks safe given X, not on how correlated or anticorrelated the predictor thinks tampering and actual-saving are. If X actually means that tampering is likely and diamond-saving is likely but their conjunction is vanishingly unlikely, the human simulator will give the same answers as if X meant they were still independent but both more likely.

• If the predictor’s P(diamond look safe) is higher than the human’s P(diamond looks safe), then it seems like the human simulator will predictably have an anticorrelation between [camera-tamperign] and [actually-saving-diamond]. It’s effectively updating on one or the other of them happening more often than it thought, and so conditioned on one it goes back to the (lower) prior for the other. This still seems right to me, despite no explicit dependence on the correlations.

• I agree the human simulator will predictably have an anticorrelation. But the direct simulator might also have an anticorrelation, perhaps a larger one, depending on what reality looks like.

Is the assumption that it’s unlikely that most identifiable X actually imply large anticorrelations?

• I do agree that there are examples where the direct translator systematically has anticorrelations and so gets penalized even more than the human simulator. For example, this could happen if there is a consequentialist in the environment who wants it to happen, or if there’s a single big anticorrelation that dominates the sum and happens to go the wrong way.

That said, it at least seems like it should be rare for the direct translator to have a larger anticorrelation (without something funny going on). It should happen only if reality itself is much more anticorrelated than the human expects, by a larger margin than the anticorrelation induced by the update in the human simulator. But on average things should be more anticorrelated than expected about as much as they are positively correlated (averaging out to ~0), and probably usually don’t have any correlation big enough to matter.

• Thanks, I consider this fully answered.

• Naive thought #2618281828:

Could asking counterfactual questions be a potentially useful strategy to bias the reporter to be a direct translator rather than a human simulator?

Concretely, consider a tuple (v, a, v’), where v := ‘before’ video, a := ‘action’ selected by SmartVault or augmented-human or whatever, and v’ := ‘after’ video.

Then, for some new action a’, ask the question:

• “Given (v, a, v’), if action a’ was taken, is the diamond in the room?”

(How we collect such data is unclear but doesn’t seem obviously intractable.)

I think there’s some value here:

• Answering such a question might not require computation concerning a and v’ ; if we see these computations being used, we might derive more value from regularizers that penalize downstream variables (which now includes the nodes close to a)

• This might also force the reporter to essentially model (or compress but not indefinitely) the predictor; the reporter now has both a compressed predictor Bayes’ net and a human Bayes’ net. If we can be confident that the compressed predictor BN is much smaller than the human BN, then doing direct translation within the reporter, i.e. compressed predictor BN inference + translation + read off from human BN might be less expensive than the human simulator alternative, i.e. compressed predictor BN inference + ‘translation’/​bridging computation + human BN inference.

• We might find ways of being confident that the compressed predictor BN is small (e.g. by adding decoders at every layer of the reporter that reconstruct v, a or v’ and heavily penalizing later-layer decoders)

• Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?

Counter: In the worst case, imitative generalization /​ learning the human prior is not competitive. In particular, it might just be harder for a model to match the human inference than to simply learn . Here is the set of instructions as in learning the prior (I think in the context of ELK would be the proposed change to the human Bayes net?)

• Here’s a few more questions about the same strategy:

If I understand correctly, the IG strategy is to learn a joint model for observations and actions , where , , and are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using , where is optimized for predictive usefulness.

This fails because there’s no easy way to get from .

A simple way around this would be to learn instead, where if the diamond is in the vault and otherwise.

1. Is my understanding correct?

2. If so, I would guess that my simple workaround doesn’t count as a strategy because we can only use this to predict whether the diamond is in the vault (or some other set of questions that must be fixed at training time), as opposed to any question we want an answer to. Is this correct? Is there some other reason this wouldn’t count, or does it in fact count?

• Last I checked there were 66 comments and now there are over a hundred so I’m just going to post and hope I’m not repeating anyone.

So I’ve been reading through the google doc, and I’m not very far into it but I have a few questions. I apologize in advance if I’m just adhering too strictly to the “SmartVault” scenario, and if I get long-winded (yay ADHD and hyperfocusing off and on about this without actually making progress for a week).

1)

Why would we make a vault that was so complicated that a human alone couldn’t run it? From a simple design standpoint and maybe from a neurological one, it makes no sense for it to take so many manipulations to do something that you need an AI handling it over having a human do it.

I say biological because I’m thinking about how when we move our arm, we don’t have to think about firing off synapses, making sure we activate and deactivate the proper muscle fibers, time the sequence of intermediary actions just right so that we get the movement we want, or even direct enzymes and hormones to ensure that the proper chemical reactions are going on and that enough energy to perform the action is being used. We just move our arm, sometimes without even thinking about it. Heck, we even go beyond that, practicing and ingraining muscle memory so that even entire sequences of actions become automatic, up to hours long in some cases (ie speedrunners, drivers, window-washers, surgeons, basically anything that requires rhythmic, repetitive movement).

From the design standpoint, I recall a paper from several years ago (no idea which one and I don’t want to lose momentum by looking and getting distracted so no link sorry) about how complex systems/​networks with many interacting nodes only require control of a tiny fraction of the available nodes to maintain control over the entire system. The example they gave was a car: cars have many interacting and interdependent systems and subsystems necessary to make the car operate, but you only need control over roughly ten of them to have complete control over the entire car (pedals, gear shift, steering wheel, mirrors, windows, maybe cameras, and various gauges). Why wouldn’t we have the same situation with our SmartVault?

2)

I understand the diamond was chosen as a relatively uncontroversial pick to represent a big pile of stored value/​utility. The problem I have with it though is that it’s not clear whether we want to be able to access this diamond in the future.

Money, rare items, NFTs, crypto, stocks, real estate etc only have value when we can extract and use that value now or in the future, otherwise they’re not even worth the materials used to make them. If we are trying to hide something bad so that nobody ever gets their hands on it (cursed diamond or something I don’t know), there are better ways of removing it from circulation (shooting it into space comes to mind).

If, however, we want to be able to extract the diamond’s value at some point in the future, we need to be including situations in our training set where the diamond is SUPPOSED to be removed. Otherwise we get the AI treating even the diamond’s owner as a threat and acting accordingly! In other words, the diamond cannot be properly protected unless we also include scenarios where the diamond being gone is a GOOD thing.

Another thing: even if the diamond is being properly protected, we would still want to physically check on the diamond ANYWAY, if only to wipe the dust off, make sure it hasn’t decayed or degraded in some way, and/​or to reexamine it for the purposes of updating its current value. We need to add in training situations where it’s also a GOOD thing that people are in the vault, so that the vault doesn’t accidently eliminate the appraiser or janitor.

3)

Since we need to be able to access the vault in certain situations anyway, this leads to a potentially nice sidestep of the whole Human Simulator vs Direct Translator dilemma (wild speculation on my part, have not put much thought into it, happy to be corrected). Since we have to physically check on the diamond periodically anyway, why not just create a node in the human’s Bayes net for

“I have precommited to the following:

I will ‘trust, but verify’ the AI on a regular basis. (performance reviews)

Should the AI tell me the diamond is there and I find it is not, regardless of the reason and cost, the AI will be deleted and rebuilt from scratch using a different set seed. (fired and replaced by someone else)

If, however, the AI tells me the diamond is gone at any point before I start physically checking, it will receive no penalty if the diamond is actually gone (because that’s the whole POINT), and be sent back to training if the diamond is actually still there (demotions are a thing).

Should the AI successfully protect the diamond until it is no longer necessary, it will get a “retirement package” of it’s own virtual playground within which it may do as it wishes (including wireheading itself), and it will be kept running for as long as the AI wishes. Details and logistics to be negotiated at, and only implemented upon reaching, decommissioning. (retirement plans are a thing)”

So, put another way, why are we treating the AI like a tool or a slave, instead of the sentient, sapient worker it is? Why aren’t we using the same techniques we would use to get a regular human (assuming one was capable) to do the job? As much as I hate it, society is built such that people need income to survive, and the threat of losing your livelihood is enough to get people to do jobs they would otherwise never voluntarily choose to do. The promise of various potential rewards (promotions, raises, benefits etc) for good performance are enough to get most people to do their jobs (relatively) well.

Any AI that is given some terminal goal must necessarily develop instrumental goals around it: continued existence (EXIST(AI)==FALSE ==> GOAL==FAIL), protection from value drift (GOAL1 <-- GOAL2 ==> GOAL1 == FAIL), threat elimination (EXIST(GOAL)==FALSE||INCAPACITATED(AI)==TRUE ==> GOAL==FAIL) etc. By combining these instrumental goals with our new (admittedly big and complicated) node, it shouldn’t matter if the AI is a direct translator or a human simulator. Since we’re in worst-case-land here, everything we know ourselves will be known to the human simulator as well, so whatever we put into our own Bayes net must necessarily be known to the simulator too. The incentives will be such that even if it is a human simulator, it could only be one successfully if it knows about the existence of that node, which should be enough incentive to get it to do what we want. Namely, protect the diamond, let us know if it is accessed or removed for invalid reasons, preferably with enough forewarning to do something about it, even when it looks to us like it hasn’t been/​won’t be, and allow it to be accessed and removed for valid reasons.

I’ve officially lost my momentum so that’s all for now. Sorry long post is long

• Thought of more things.

4)

We obviously have sensors and things inside the vault. So what is preventing a thief from just stealing the entire freaking vault itself to be broken into at a separate location later. We need some way of decoupling “still in the vault” from “still at this particular location on earth”. This means we’d need sensors and things on the OUTSIDE of the vault as well, or at least a GPS or something, which means we can’t really isolate our AI from knowledge about the outside world, nor limit it to thinking “the vault is all there is, and all there is is the vault”. It needs to know, at the very least, that there is an oblate spheroid upon which the diamond is located, and the diamond must stay within a certain distance of its current location at all times (at least until decommisioning etc).

5)

Do we care about legal/​economic/​political attacks to get the diamond? If continued ownership of the diamond is a prerequisite for successful protection, then what would it do about, say, losing the diamond in a divorce settlement, or being driven to bankruptcy and the diamond getting repossessed, or losing it as stakes in a bet? Will the AI know about inheritance and estates? What about if the diamond and vault is government owned and the diamond is lost as spoils of war?

I think I’ll stop before my imagination runs away from me.

• A small suggestion: the counterexample to “penalize downstream”, as I understand it, requires there to be tampering in the training data set. It seems conceptually cleaner to me if we can assume the training data set has not been tampered with (e.g. because if alignment only required there to be no tampering in the training data, that would be much easier).

The following counterexample does not require tampering in the training data:

1. The predictor has nodes indicating whether the diamond was stolen at time

2. It also has node indicating whether the diamond was ever stolen. The direct translator would look at this node.

3. However, it happens that in the training data , i.e. we only ever needed to look at the last frame of the video

4. Therefore, the human interpreter can look only at and get the same loss as the direct translator, despite being upstream of it.

(I’m interested in pursuing approaches that assume training data has not been tampered with. Maybe nobody but me cares about this, but posting in case somebody else does. I may be understanding something here – corrections are appreciated.)

• I think it’s good to assume that there is no tampering in the training set.

In the document we say that we’re worried about the reporter that waits until it sees a good argument that “The human won’t be confident that the diamond isn’t in the room” and then says “the diamond is in the room” as soon as it finds one. We claim that this helps on the training set, and then argue that it would lead to bad behavior given certain kinds of tampering.

But you’re correct that we don’t actually give any examples where this heuristic actually helps. To give one now (credit Mark): suppose that if the diamond is in the room at time T, then at time T+1 it will either be in the room or something confusing will happen that will leave the human unconfident about whether the diamond is still in the room. Then as soon as you figure out that the diamond is in the room at time T, you might as well answer “the diamond is in the room at time T+1” even if you aren’t actually sure of that.

The counterexample you describe has a different flavor but is also valid (both for “depending on downstream variables” and “computation time”)---the reporter can save time by baking in some assumptions that are only true on the training distribution. There are various ways you could try to address this kind of problem, and it seems interesting and important. We don’t get into any of that in the doc. That’s partly because we haven’t really worked through any of the details for any of those approaches, so they would be welcome contributions!

• Therefore, the human interpreter can look only at and get the same loss as the direct translator, despite being upstream of it.

Maybe I misunderstand you, but how I understand it, is based on the last frame of the predicted video, and therefore is basically the most downstream thing there is. How did you come to think it was upstream of the direct translator?

• While reading through the report I made a lot of notes about stuff that wasn’t clear to me, so I’m copying here the ones that weren’t resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.

Footnote 14, page 15:

Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor’s Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since most of those parameters do not specify a single monolithic model in which inference is performed.

Can we make the assumption that defeating the method allows the AI to get better loss since it’s effectively wireheading at that point? If so, then wouldn’t a realistic messy AI learn a Bayes net once it had >= 109 parameters? In other words, are there reasons beyond performance that preclude an AI from learning a single monolithic model?

Footnote 33, page 30 (under the heading “Strategy: have AI help humans improve our understanding”):

Most likely this would involve some kind of joint training, where our AI helps humans understand the world better in parallel with using gradient descent to develop its own understanding. To reiterate, we are leaving details vague because we don’t think that our counterexample depends on those details.

I realize this is only a possible example of how we might implement this, but wouldn’t a training procedure that explicitly involves humans be very anti-competitive? The strategy described in the actual text sounds like it’s describing an AI assistant that automates science well enough to impart us with all the predictor’s knowledge, which wouldn’t run into this issue.

Footnote 48 to this paragraph on page 36:

The paradigmatic example of an ontology mismatch is a deep change in our understanding of the physical world. For example, you might imagine humans who think about the world in terms of rigid bodies and Newtonian fluids and “complicated stuff we don’t quite understand,” while an AI thinks of the world in terms of atoms and the void. Or we might imagine humans who think in terms of the standard model of physics, while an AI understands reality as vibrations of strings. We think that this kind of deep physical mismatch is a useful mental picture, and it can be a fruitful source of simplified examples, but we don’t think it’s very likely.

Footnote:

And if it did occur it seems like an unusually good candidate for a case where doing science (and in particular tracking how the new structures implement the old structures) outcompetes gradient descent, and on top of that a case where translation is likely to be relatively easy to pick out with suitable regularization.

I might be reading too much into this, but I don’t understand the basis of this claim. Is it that the correspondence differs only at the low-level? If so, I still don’t see how science outcompetes gradient descent.

Page 51, under the heading “[ELK] may be sufficient for building a worst-case solution to outer alignment:

Use imitative generalization combined with amplification to search over some space of instructions we could give an amplified human that would let them make cakes just as delicious as Cakey’s would have been.

I haven’t thoroughly read the article on amplification, so this question may be trivial, but my understanding is that amplified humans are more or less equivalent to humans with AI-trained Bayes nets. If true, then doesn’t this require the assumption that tasks will always have a clean divide between the qualitative (taste of cakes) which we can match with an amplified human, and the quantitative (number of cakes produced per hour) which we can’t? That feels like it’s a safe assumption to make, but I’m not entirely sure.

Page 58, in the list of features suggesting that M(x) knew that A’ was the better answer:

• That real world referent Z has observable effects and the human approximately understands those effects (though there may be other things that also affect observations which the human doesn’t understand)

• ...

• The referent Z is also relevant to minimizing the loss function ℒ. That is, there is a coherent sense in which the optimal behavior “depends on” Z, and the relative loss of different outputs would be very different if Z “had been different.”

• There is a feature of the computation done by the AI which is robustly correlated with Z, and for which that correlation is causally responsible for M achieving a lower loss.

First, why is the first point necessary to suggest that M(x) knew that A’ was the better answer? Second, how are the last two points different?

Page 69, under “Can your AI model this crazy sequence of delegation?”:

We hope that this reasoning is feasible because it is closely analogous to a problem that the unaligned AI must solve: it needs to reason about acquiring resources that will be used by future copies of itself, who will themselves acquire resources to be used by further future copies and so on.

We need the AI to have a much smaller margin of error when it comes to modelling this sequence of delegation than needed for the AI to reason about acquiring resources for future copies—in other words, for a limited amount of computation, the AI will still try to reason about acquiring resources for future copies and could succeed in the absence of other superintelligences because of the lack of serious opposition, but modelling the delegation with that limited computation might be dangerous because of the tendency for value drift.

Page 71:

… we want to use a proposal that decouples “the human we are asking to evaluate a world” from “the humans in that world”—this ensures that manipulating the humans to be easily satisfied can’t improve the evaluation of a world.

Is it possible for the AI to manipulate the human in world i to be easily satisfied in order to improve the evaluation of world i+1?

Page 73:

As I understand this, z_prior is what the model expects to happen when it sees “action” and “before”, z_posterior is what it thinks actually happened after it sees “after”, and kl is the difference between the two that we’re penalizing it on. What is logprob doing?

• I’ve been trying to understand this paragraph:

That is, it looks plausible (though still <50%) that we could improve these regularizers enough that a typical “bad” reporter was a learned optimizer which used knowledge of direct translation, together with other tricks and strategies, in order to quickly answer questions. For example, this is the structure of the counterexample discussed in Section: upstream. This is a still a problem because e.g. the other heuristics would often “misfire” and lead to bad answers, but it is a promising starting point because in some sense it has forced some optimization process to figure out how to do direct translation.

This comment is half me summarizing my interpretation of it to help others, and half an implicit question for the ARC team about whether my interpretation is correct.

1. What is a “bad” reporter? I think the term is used to refer to a reporter which is at least partially a human interpreter, or at least one which can’t confidently be said to be a direct translator.

2. What does it mean to “use knowledge of direct translation”? I think this means that, at least in some circumstances, it acts as a direct translator. I.e. there is some theoretical training data set + question such that the reporter will act as a direct translator. (Do we have to be able to prove this? Or do we just need to state it’s likely?)

3. How did the “upstream” counterexample “force some optimization process to figure out how to do direct translation”? I think this is saying that, if we were in a world where the direct translation nodes were upstream of the “human interpreter” nodes, the upstream regularizer would successfully force the reporter to do direct translation.

4. Why is this “a promising starting point?” Maybe we could find some other way of forcing the direct translator nodes to be upstream of the human interpreter ones, and then that strategy combined with the upstream regulatizer would force a direct translator.

Corrections and feedback on this extremely welcome!

1. “Bad reporter” = any reporter that gives unambiguously bad answers in some situations (in the ontology identification case, basically anything other than a direct translator)

2. “use knowledge of direct translation” = it may be hard to learn direct translation because you need a bunch of parameters to specify how to do it, but these “bad” reporters may also need the same bunch of parameters (because they do direct translation in some situations)

3. In the “upstream” counterexample, the bad reporter does direct translation under many circumstances but then sometimes uses a different heuristic that generates a bad answer. So the model needs all the same parameters used for direct translation, as mentioned in the last point. (I think your understanding of this was roughly right.)

4. More like: now we’ve learned a reporter which contains what we want and also some bad stuff, you could imagine doing something like imitative generalization (or e.g. a different regularization scheme that jointly learned multiple reporters) in order to get just what we wanted.

• Thanks!

• In “Strategy: penalize computation time” you say:

> At first blush this is vulnerable to the same counterexample described in the last section [complexity]… But the situation is a little bit more complex… the direct translator may be able to effectively “re-use” that inference rather than starting from scratch

It seems to me that this “counter-counterexample” also applies for complexity – if the translator is able to reuse computation from the predictor, wouldn’t that both reduce the complexity and the time?

(You don’t explicitly state that this “reuse” is only helpful for time, so maybe you agree it is also helpful for complexity – just trying to be sure I understand the argument.)

• It seems to me that this “counter-counterexample” also applies for complexity – if the translator is able to reuse computation from the predictor, wouldn’t that both reduce the complexity and the time?

The difference is that the complexity savings are bounded by a constant that doesn’t depend on the size of the predictor’s model (namely the complexity of the human’s model). But in the case of computation time, the savings can be as large as “compute used to do inference in the human’s model.” And that scales up with the size of our computers.

• Clarification request. In the writeup, you discuss the AI Bayes net and the human Bayes net as if there’s some kind of symmetry between them, but it seems to me that there’s at least one big difference.

In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.

In the case of the human, we have no idea what the Bayes net looks like, because humans don’t have that kind of introspection ability. In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.

Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?

• In the case of the AI, the Bayes net is explicit, in the sense that we could print it out on a sheet of paper and try to study it once training is done, and the main reason we don’t do that is because it’s likely to be too big to make much sense of.

We don’t quite have access to the AI Bayes net—we just have a big neural network, and we sometimes talk about examples where what the neural net is doing internally can be well-described as “inference in a Bayes net.”

So ideally a solution would use neither the human Bayes net or the AI Bayes net.

But when thinking about existing counterexamples, it can still be useful to talk about how we want an algorithm to behave in the case where the human/​AI are using a Bayes net, and we do often think about ideas that use those Bayes nets (with the understanding that we’d ultimately need to refine them into approaches that don’t depend on having an explicit Bayes net).

• In fact, there’s not much difference between saying “the human uses a Bayes net” and “the human uses some arbitrary function F, and we worry the AI will figure out F and then use it to lie to us”.

I think that there isn’t much difference between the two in this case—I was reading the Bayes net example as just illustration for the point that any AI sufficiently powerful to pose risk would be able to model humans with high fidelity. For that matter, I think that the AI Bayes net was also for illustration, and realistically the AI could learn other methods of reasoning about the world, maybe which include Bayes nets in some form.

Or am I actually wrong and it’s okay for a “builder” solution to assume we have access to the human Bayes net?

I think we can’t assume this naively, but that if you can figure out a competitive and trustworthy way to get this (like with AI assistance), then it’s fair game.

• Okay, so if the builder solution can’t access the human Bayes net directly that kills a “cheap trick” I had. But I think the idea behind the trick might still be salvageable. First, some intuition:

If the diamond was replaced with a fake, and owner asks, “is my diamond still safe?” and we’re limited to a “yes” or “no” answer, then we should say “no”. Why? Because that will improve the owner’s world model, and lead them to make better predictions, relative to hearing “yes”. (Not across the board: they will be surprised to see something shiny in the vault, whereas hearing “yes” would have prepared them better for that. But overall accuracy, weighted by how much they CARE about being right about it, should be higher for “no”.)

So: maybe we don’t want to avoid the human simulator. Maybe we want to encourage it and try to harness it to our benefit! But how to make this precise? Roughly speaking, we want our reporter to “quiz” the predictor (“what would happen if we did a chemical test on the diamond to make sure it has carbon?”) and then give the same quiz to its model of the human. The reporter should output whichever answer causes the human model to get the same answers on the reporter’s quiz as the predictor gets.

Okay that’s a bit vague but I hope it’s clear what I’m getting at. If not, I can try to clarify. (Unless the vagueness is in my thoughts rather than in my “writeup”/​paragraph.) Possible problem: how on earth do we train in such a way as to incentivize the reporter to develop a good human model? Just because we’re worried it will happen by accident doesn’t mean we know how to do it on purpose! (Though if it turns out we can’t do it on purpose, maybe that means it’s not likely to happen by accident and therefore we don’t need to worry about dishonesty after all??)

• One more stupid question—how is this different from a “man in the middle” attack? (Term from cryptography where you cannot trust your communications, because of a malicious agent between you and your recipient who’s changing your messages)
The current recommended solution for those is encrypting your communication before you send it; I don’t know that there are any extant solutions for noticing you’ve got an MITM situation after the fact.

• Man in the middle has 3 parties: Bob wants to talk to Alice, but we have Eve who wants to eavesdrop.

Here we have just 2 parties: Harry the human wants to talk to Alexa the AI, but is worried that Alexa is a liar.

• If I understand the problem statement correctly, I think I could take a stab at easier versions of the problem, but that the current formulation is too much to swallow in one bite. In particular I am concerned about the following parts:

### Setting

We start with an unaligned benchmark:

* An architecture Mθ

<snip>

### Goal

To solve ELK in this case we must:

* Supply a modified architecture Mθ+ which has the same inputs and outputs as Mθ <snip>

Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?

except that after producing all other outputs it can answer a question Q in natural language

Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. “Where is the diamond?”) right?

According to my current model of how these sorts of things work, such constraints makes the problem fundamentally unsolvable, so I am not even going to attempt it, while loosening the constraints may make it solvable, and so I might attempt it.

• Does this mean that the method needs to work for ~arbitrary architectures, and that the solution must use substantially the same architecture as the original?

Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting.

Does this mean that it must be able to deal with a broad variety of questions, so that we cannot simply sit down and think about how to optimize the model for getting a single question (e.g. “Where is the diamond?”) right?

Yes, approximately. Thinking about how to get one question right might be a productive way to do research. However, if you have a strategy for answering 1 question right, it should also work for other questions.

• Yes, approximately. If you can do it for only e.g. transformers, but not other things, that would be interesting.

I guess a closer analogy would be “What if the family of strategies only works for transformer-based GANs?” than “What if the family of strategies only works for transformers?”. As in there’d be heavy restrictions on both the “layer types”, the I/​O, and the training procedure?

Yes, approximately. Thinking about how to get one question right might be a productive way to do research. However, if you have a strategy for answering 1 question right, it should also work for other questions.

What if each question/​family of questions you want to answer requires careful work on the structure of the model? So the strategy does generalize, but it doesn’t generalize “for free”?

• Would changing how the reward function pays off work? Instead of rewarding based on humans, pay out all rewards when the vault is checked (at a time unknown to the AI). The AI isn’t asked if the diamond is present or absent. Instead, it is asked “If the vault were checked now, do you want to be rewarded if the diamond is present or absent.

• I think this might still lead to similar problems. For example this could cause an issue in the case where the diamond has been stolen but the AI believes humans would not be able to tell even if they physically entered the vault and checked, e.g the diamond has been replaced with a very convincing fake. In this case the AI might still say “I want to be rewarded if the diamond is still present” since it knows humans won’t be able to tell the difference.

• I’m a newcomer to this, I lack much of the background, and I’m probably suggesting a solution that’s too specific to this diamond heist scenario. But, I already spent an hour writing it down, so I might as well share it.

Trusted timestamping, cryptographically secure sensor

This is a very basic “builder move”, I guess? The idea is to simply improve our sensors so that it’s very hard to tamper with them, through public-private key encryption. The diamond will have a small chip that constantly sends a cryptographically-signed timestamped life signal, and the AI system will be required to keep it active. The best (and ideally, only) way to do so would be by keeping the diamond protected within the vault.

• I want to steal the diamond. I don’t care about the chip. I will detach the chip and leave it inside the vault and then I will run away with the diamond.

Or perhaps you say that you attached the chip to the diamond very well, so I can’t just detach it without damaging it. That’s annoying but I came prepared! I have a diamond cutter! I’ll just slice off the part of the diamond that the chip is attached to and then I will steal the rest of the diamond. Good enough for me :)

• The implementation could possibly be extended to cover more weak points.

For example, you could cover the diamond with additional chips in all sides. Or you could make the chip so fragile that it breaks when the diamond is affected by strong enough vibrations (as is likely, with a diamond cutter). Or you could create more complex (but hard/​impossible to tamper with) chips that continuously confirm stuff like “no object has come within 10cm of the diamond” or “the temperature remained regular” or “the weight on the pedestal is exactly X grams”.

My main proposal here is the concept of having better sensors that can’t have their data faked. I think with enough engineering effort you could cover enough “edge cases” that you can trust the AI system to predict robbery every time robbery happens, because a mistake/​deception has improbably low odds of happening.

• I’m not an ARC member, but I think assuming that the chip is impossible to tamper with is assuming the conclusion.

The task is to train a reporter which accurately reports the presence of the diamond, even if we are unable to tell whether tampering has occurred (e.g. because the AI understands some esoteric physics principle which lets them tamper with the chip in a way we don’t understand). See the section on page 6 starting with “You might try to address this possibility by installing more cameras and sensors...”

• Potentially silly question:

In the first counterexample you describe the desired behavior as

Intuitively, we expect each node in the human Bayes net to correspond to a function of the predictor’s Bayes net. We’d want the reporter to simply apply the relevant functions from subsets of nodes in the predictor’s Bayes net to each node in the human Bayes net [...]

After applying these functions, the reporter can answer questions using whatever subset of nodes the human would have used to answer that question.

Why doesn’t the reporter skip the step of mapping the predictor’s Bayes net to the human’s and instead just answer the question using its own nodes? What’s the benefit of having the intermediate step that maps the predictor’s net to the human’s?

• We generally imagine that it’s impossible to map the predictors net directly to an answer because the predictor is thinking in terms of different concepts, so it has to map to the humans nodes first in order to answer human questions about diamonds and such.

• I see, thanks for answering. To further clarify, given the reporter’s only access to the human’s nodes is through the human’s answers, would it be equally likely for the reporter to create a mapping to some other Bayes net that is similarly consistent with the answers provided? Is there a reason why the reporter would map to the human’s Bayes net in particular?

• The dataset is generated with the human bayes net, so it’s sufficient to map to the human bayes net. There is, of course, an infinite set of “human” simulators that use slightly different bayes nets that give the same answers on the training set.

• Edit: think this isn’t quite right in general, will try to make it more correct later

Here’s a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it’s modeling the difficulty correctly/​seems possibly worth figuring out how to implement

It seems like the problem is:

• On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)

• A. Is the diamond safe at time t-1

• B. Is the diamond safe at time t (the variable we actually care about)

• C. Do photons bounce off of the diamond at time t

• D. Are photons in transit between the diamond and the camera at time t+1

• E. Do photons arrive in the camera at time t+2

• F. Do photons trigger the camera’s electronics at time t+3

• G. Does the human correctly interpret the signal from the camera

• H. Does the human believe “the diamond is safe”

• The reporter can get low loss if it has a fast way to predict that value of any of these variables. This is fine on the training set, because the variables are all perfectly correlated.

• But it might be possible to create ways for some of these variables to hold, but not others off the training set. The thing that is going wrong in Counterexample: “what the human will believe” can be understood before “what is actually true” seems like “E through H are true in this case, but A-D are false”.

Possible approach:

• Instead, we could try to learn a reporter that returns true if and only if all relevant variables A-H are true

• Then the remaining problem should just be that this reporter is overly conservative: there might be some strategies that would be helpful that involve changing some of these variables, e.g. moving the diamond around the room (changing A) or turning off the lights in the room (changing C-H, which even causes the human to believe that the diamond isn’t safe)

• Question: Would a proposal be ruled out by a counterexample even if that counterexample is exponentially unlikely?

I’m imagining a theorem, proved using some large deviation estimate, of the form: If the model satisfies hypotheses XYZ, then it is exponentially unlikely to learn W. Exponential in the number of parameters, say. In which case, we could train models like this until the end of the universe and be confident that we will never see a single instance of learning W.

• I’d be fine with a proposal that flips coins and fails with small probability (in every possible world).

• Hello, I have some issue with the epistomology of the problem : my problem is that even if the process of training was giving the behavior we want, we would have no way to check the IA is working properly in practice.
I try now to give more details : in the volt probleme, given the same information, let’s think of an IA that just as to answer the question “Is the diamon still in the volt ?”.

Something we can suppose is that, the set Y, from which we draw the labeled examples to train the IA (a set of technique for the thief), is not important : trying to increase its size it isn’t a solution (because there is always something that can be thought out of our imagination). We can in fact try to solve the problem relatively to Y. We consider then X the scenarios that the IA can understand given it was trained on Y. Then the only way to act on X\Y is to train on Y in a specific way (I think). So we need a link between X and Y that we can exploit. So we need to know what X looks like, but we can’t since its the goal. The only thing we could know is X’ the set of scenarios which could be imagined or understood by a human, even if that human could not label such the scenario. Since we don’t know if X = X’ by definition, there may always be some cases in which the IA understood how the thief did but we don’t.
To me, the problem here is to have the IA giving us the information it has when the thief uses a technique in X’ and not X. Because in X\X’, there is nothing we know to help us guide the IA toward having a good behavior on this set. But it seems possible in X’, because we can imagine scenarios and so ways of guiding the IA.
So the thing I don’t understand is why the counter-example with the thief using a secret property of transistors is a good counter-example ? To me, we are in the case were because the method is out of reach for the humans, we have no idea if the IA tells the truth or not, because we can’t be sure to train an IA to have a specific behavior on example we could not imagine. Moreover we can’t check if it says the truth, so how would we trust it ?.

Thank you for reading

• If the predictor AI is in fact imitating what humans would do, why wouldn’t it throw its hands up at an actuator sequence that is too complicated for humans—isn’t that what humans would do? (I’m referring to the protect-the-diamond framing here.)

• As described in the report it would say “I’m not sure” when the human wasn’t sure (unless you penalized that).

That said, often a human who looks at a sequence of actions would say “almost certainly the diamond is there.” They might change their answer if you also told them “by the way these actions came from a powerful adversary trying to get you to think the diamond is there.” What exactly the reporter says will depend on some details of e.g. how the reporter reasons about provenance.

But the main point is that in no case do you get useful information about examples that a human (with AI assistants) couldn’t figure out what was happening on their own.

• Naive question: does this scenario include cases of a human physically breaking into the vault at some random times so that sensor information, predictor reports, and outcome to human in this situation would be known?

• Silly question warning.

You think that when an AI performs a bad action, (say remove the diamond) the AI has to have knowledge that the diamond is in fact no longer there. Even when the camera shows the diamond is (falsely) there and the human confirms that the diamond is there.

You call this ELK

You want the human to have access to this knowledge, as this is useful to choosing decisions that the human wants.

This is hard. So you have people propose how to do this.

And then people try to explain why that strategy wouldn’t work.

Rinse and repeat.

Once you have a proposal that nobody is able to show doesn’t work.… profit?

Correct any misunderstandings in my basic overview above.

• This broadly seems right. Some details:

• The “explain why that strategy wouldn’t work” step specifically takes the form of “describing a way the world could be where that strategy demonstrably doesn’t work” (rather than more heuristic arguments).

• Once we have a proposal where we try really hard to come up with situations where it could demonstrably fail, and can’t think of any, we will probably need to do lots of empirical work to figure out if we can implement it and if it actually works in practice. But we hope that this exercise will teach us a lot about the nature of the empirical work we’ll need to do, as well as providing more confidence that the strategy will generalize beyond what we are able to test in practice. (For example, ELK was highlighted as a problem in the first place after ARC researchers thought a lot about possible failure modes of iterated amplification.)

• You can just make a plecsi glass by 2 by 2 metters, and on the inside to put senzors that are conected to a battery on the inside of the glass box.These senzors to intersept every cm of the glass,moving,breaking the glass being imposible because all the meccanics are on the inside.And if you want to take the diamont from there like an guard man or somethin’ just let everyone know you are takeing it from there and let the alarm to ring.When the senzors are going on they will make a signal to the guard Who are protecting it and come to catch the thief .More details of the placsi glass,first it will be 25 cm thick and scrueld down on the floor.And you can insstal also a gas inside the plecsi glass to put to sleep the Thief and a camera. 1.if the thief has a gas masc you can put a divice like on roice rolls (the brand of car) to retract the diamont in ground and lockit safely. 2.the thief cant cut the wiers or jamm the signal because the mecanism îs on the inside and the signal will be wireless with a wire throw the concre that will go to the guard.The wire having just 10 m and it cant be cut.

I hope i helped.Contact me on my email if I won the prise!

• [ ]
[deleted]
• If you want, you can slightly refactor my proposal to include a reporter module that takes the primary model’s hidden representations as input and outputs more interpretable representations for the student models to use. That would leave the primary model’s training objective unchanged. However, I don’t think this is a good idea for much the same reason that training just the classification head of a pretrained language model isn’t a good idea.

However, I think training the primary model to be interpretable to other systems may actually improve economic competitiveness. The worth of a given approach depends on the ratio of capabilities to compute required. If you have a primary model whose capabilities are more easily distilled into smaller models, that’s an advantage from a competitiveness standpoint. You can achieve better performance on cheaper models compared to competitors.

I think people are FAR too eager to assume a significant capabilities/​interpretability tradeoff. In a previous post, I used analogies to the brain to argue that there’s enormous room to improve the interpretability of existing ML systems with little capabilities penalty.

To go even further, more interpretable internal representations may actually improve learning. ML systems face their own internal interpretability problems. To optimize a system, gradient descent needs to be able to disentangle which changes will benefit vs harm the system’s performance. This is a form of interpretability, though not one we often consider.

Being “interpretable to gradient descent” is very different from being “interpretable to humans”. However, most of my proposal focuses on making the primary model generally interpretable to many different systems, with humans as a special case. I think being more interpretable may directly lead to being easier to optimize. Intuitively, it seems easier to improve a system with simple, contained and modular internal components that interact with each other in a straightforward, consistent manner.

Imagine the primary model as being made up of a collection of interacting circuits. If those circuits have many complex, nonlinear interdependencies, then gradient descent will have issues with quickly modifying the circuits. Significant changes to one circuit will cause issues with the circuits that depend on the modified circuit for their input. Alternatively, if the circuits are more modular with robust, consistent interfaces, then it’s easier for gradient descent to make significant changes quickly, without messing up the rest of the network. This is closely related to the fact that documentation and good APIs become vastly more important as software projects increase in size.

(Note that you can view vanishing/​exploding gradients as an interpretability issue preventing gradient descent from figuring out how it should change certain weights, and vanishing/​exploding gradients certainly hurt performance.)