How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.
redbird
Your proposal is that it might learn the procedure “just be honest” because that would perform perfectly on this training distribution. You contrast this against the procedure “just answer however the evaluator you’ve seen most recently would answer,” which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?
That’s almost right, but it’s being penalized right away, before it has any experience with the strong evaluators, so it can’t simulate them.
The ELK paper says we can assume, if we want, that there are no mislabeled training points (I’ll call this “assumption A”). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.
As a simple example, let’s train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels
if thinks the diamond is still there, else 0
if thinks the diamond is still there, else 0.
By assumption A, our training data is such that is always correct. But we deliberately choose a dataset where say 10% of the labels are wrong ().
Then we train the model on points of the form
(video, action, label).
Crucially, the model does not see The model seeks to output that maximizes reward , where
if is right and (good job)
if is wrong and (you rock, thanks for correcting us!)
if is right and (bad model, never ever deceive us)
if is wrong and (bad model, never ever deceive us)
To your point, sure, an simulator will get perfect reward, but the model doesn’t see , so how would it acquire the ability to simulate ?
EDIT: One way it could plausibly simulate is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the labels will be wrong. If we only penalize it for deception on the examples where we’re sure the label is right, then it can still infer something about from our failure to penalize (“Hmm, I got away with it that time!”). A fix could be to add noise: Sometimes we don’t penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).
The irony of deceiving it about us, in order to teach it not to deceive us… !
Idea: Withhold Material Information
We’re going to prevent the reporter from simulating a human, by giving the human material information that the reporter doesn’t have.
Consider two camera feeds:
Feed 1 is very low resolution, and/or shows only part of the room.
Feed 2 is high resolution, and/or shows the whole room.
We train a weak predictor using Feed 1, and a strong predictor using Feed 2.
We train a reporter to report the beliefs of the weak predictor, using scenarios labeled by humans with the aid of the strong predictor. The humans can correctly label scenarios that are hard to figure out with Feed 1 alone, by asking the strong predictor to show them its predicted Feed 2. The reporter is unable to simulate the human evaluators because it doesn’t see Feed 2. Even if it has perfect knowledge of the human Bayes net, it doesn’t know what to plug in to the knowledge nodes!
Then we fine-tune the reporter to work with the strong predictor to elicit its beliefs. I haven’t figured out how to do this last step, maybe it’s hard?
- 10 Jan 2022 12:33 UTC; 1 point) 's comment on Prizes for ELK proposals by (
[Question] Total compute available to evolution
AGI timeline is not my motivation, but the links look helpful, thanks!
Hmm how would you define “percentage of possibilities explored”?
I suggested several metrics, but I am actively looking for additional ones, especially for the epigenome and for communication at the individual level (e.g. chemical signals between fungi and plants, animal calls, human language).
Awesome!!! Exactly the kind of thing I was looking for
What are FLOPz and FLOPs ?
What sources did you draw from to estimate the distributions?
Thanks! It’s your game, you get to make the rules :):)
I think my other proposal, Withhold Material Information, passes this counterexample, because the reporter literally doesn’t have the information it would need to simulate the human.
Question: Would a proposal be ruled out by a counterexample even if that counterexample is exponentially unlikely?
I’m imagining a theorem, proved using some large deviation estimate, of the form: If the model satisfies hypotheses XYZ, then it is exponentially unlikely to learn W. Exponential in the number of parameters, say. In which case, we could train models like this until the end of the universe and be confident that we will never see a single instance of learning W.
- 15 Jan 2022 19:18 UTC; 4 points) 's comment on Prizes for ELK proposals by (
Great links, thank you!!
So your focus was specifically on the compute performed by animal brains.
I expect total brain compute is dwarfed by the computation inside cells (transcription & translation). Which in turn is dwarfed by the computation done by non-organic matter to implement natural selection. I had totally overlooked this last part!
I agree this is a problem. We need to keep it guessing about the simulation target. Some possible strategies:
Add noise, by grading it incorrectly with some probability.
On training point , reward it for matching for a random value of .
Make humans a high-dimensional target. In my original proposal, was strictly stronger as increases, but we could instead take to be a committee of experts. Say there are 100 types of relevant expertise. On each training point, we reward the model for matching a random committee of 50 experts selected from the pool of 100. It’s too expensive simulate all (100 choose 50) possible committees!
None of these randomization strategies is foolproof in the worst case. But I can imagine proving something like “the model is exponentially unlikely to learn an simulator” where is now the full committee of all 100 experts. Hence my question about large deviations.
“Train the predictor on lots of cases until it becomes incredibly good; then train the reporter only on the data points with missing information, so that it learns to do direct translation from the predictor to human concepts; then hope that reporter continues to do direct translation on other data points.”
That’s different from what I had in mind, but better! My proposal had two separate predictors, and what it did is reduce the human strong predictor OI problem (OI = “ontology identification”, defined in the ELK paper) to the weak predictor strong predictor OI problem. The latter problem might be easier, but I certainly don’t see how to solve it!
Your version is better because it bypasses the OI problem entirely (the two predictors are the same!)
Now for the problem you point out:
The problem as I see it is that once the predictor is good enough that it can get data points right despite missing crucial information,
Here’s how I propose to block this. Let be a high-quality video and an action sequence. Given this pair, the predictor outputs a high-quality video of its predicted outcome. Then we downsample and to low-quality and , and train the reporter on the tuple where is the human label informed by the high-quality and .
We choose training data such that
1. The human can label perfectly given the high-quality data ; and
2. The predictor doesn’t know for sure what is happening from the low-quality data alone.
Let’s compare the direct reporter (which truthfully reports the probability that the diamond is in the room, as estimated by the predictor who only has the low-quality data) with the human simulator.
The direct reporter will not get perfect reward, since the predictor is genuinely uncertain. Sometimes the predictor’s probability is strictly between 0 and 1, so it gets some loss.
But the human simulator will do worse than the direct reporter, because it has no access to the high-quality data. It can simulate what the human would predict from the low-quality data, but that is strictly worse than what the predictor predicts from the low-quality data.
I agree that we still have to “hope that reporter continues to do direct translation on other data points”, and maybe there’s a counterexample that shows it won’t? But at the very least the human simulator is no longer a failure mode!
Thanks for the comment!
I know you are saying it predicts *uncertainly,* but we still have to have some framework to map uncertainty to a state, we have to round one way or the other. If uncertainty avoids loss, the predictor will be preferentially inconclusive all the time.
There’s a standard trick for scoring an uncertain prediction: It outputs its probability estimate p that the diamond is in the room, and we score it with loss if the diamond is really there, otherwise. Truthfully reporting p minimizes its loss.
So we could sharpen case two and say that sometimes the AI’s camera intentionally lies to it on some random subset of scenarios
You’re saying that giving it less information (by replacing its camera feed with a lower quality feed) is equivalent to sometimes lying to it? I don’t see the equivalence!
if you overfit on preventing human simulation, you let direct translation slip away.
That’s an interesting thought, can you elaborate?
Young kids don’t make a clear distinction between fantasy and reality. The process of coming to reject the Santa myth helps them clarify the distinction.
It’s interesting to me that young kids function as well as they do without the notions of true/false, real/pretend! What does “belief” even mean in that context? They change their beliefs from minute to minute to suit the situation.
Even for most adults, most beliefs are instrumental: We only separate true from false to the extent that it’s useful to do so!
“no free lunch in intelligence” is an interesting thought, can you make it more precise?
Intelligence is more effective in combination with other skills, which suggests “free lunch” as opposed to tradeoffs.
There’s a trap here where the more you think about how to prevent bad outcomes from AGI, the more you realize you need to understand current AI capabilities and limitations, and to do that there is no substitute for developing and trying to improve current AI!
A secondary trap is that preventing unaligned AGI probably will require lots of limited aligned helper AIs which you have to figure out how to build, again pushing you in the direction of improving current AI.
The strategy of “getting top AGI researchers to stop” is a tragedy of the commons: They can be replaced by other researchers with fewer scruples. In principle TotC can be solved, but it’s hard. Assuming that effort succeeds, how feasible would it be to set up a monitoring regime to prevent covert AGI development?
if you are smarter at solving math tests where you have to give the right answer, then that will make you worse at e.g. solving math “tests” where you have to give the wrong answer.
Is that true though? If you’re good at identifying right answers, then by process of elimination you can also identify wrong answers.
I mean sure, if you think you’re supposed to give the right answer then yes you will score poorly on a test where you’re actually supposed to give the wrong answer. Assuming you get feedback, though, you’ll soon learn to give wrong answers and then the previous point applies.
Anecdotal example of trade with ants (from a house in Bali, as described by David Abrams):
The daily gifts of rice kept the ant colonies occupied–and, presumably, satisfied. Placed in regular, repeated locations at the corners of various structures around the compound, the offerings seemed to establish certain boundaries between the human and ant communities; by honoring this boundary with gifts, the humans apparently hoped to persuade the insects to respect the boundary and not enter the buildings.
Stupid proposal: Train the reporter not to deceive us.
We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100.
It’s good at generalizing, so wouldn’t it learn to never ever deceive?