Would be good to have examples that include the relevant Info Constraints: something that just retrieves things Eliezer has said seems less useful than something that can come up with things Eliezer would say based on having the same information available.
Would prefer to have fully written examples for this (e.g. how would someone who thought “compress sensory information” was a good objective function describe it to the critic?)
This feels like too specific a task/less generally useful to AI alignment research than your proposal on “Extract the the training objective from a fully-trained ML model”
I think it’s fine to have tasks that wouldn’t work for today’s language models like those that would require other input modalities. Would prefer to have fully specified inputs but these do seem easy to produce in this case. Would be ideal if there were examples with a smaller input size though.
Potentially interesting task, would prefer examples in the AI alignment domain.
Seems like a reasonable task, but wonder if it would be easier in practice to just have a wiki or something like https://metacademy.org/ or get post authors to do this (mostly depends on the size of the graph of concepts you need to connect, if it’s smaller makes sense for people to do it, if it’s larger then maybe automation helps).
Estimates 0.5% (346,000) of UK population reports long covid symptoms that limit their day-to-day activities “a lot” (possibly too high if other stuff mistaken as long covid?)
Another article (that I didn’t look at beyond the headline) suggests 7⁄10 people have had covid https://www.theguardian.com/world/2022/apr/22/seven-in-10-people-in-england-have-had-covid-research-shows-omicron
Together seems to suggest risk less than 1% of the worst kind of it.
Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that’s analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of “set a reallly low classifier threshold” wouldn’t work, so you’d be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.
One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)
Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:
without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
make bets of the form “we’ll reach X% performance on task A before we reach Y% performance on task B”
predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
Suppose there are two worlds, world W1 and world W2.
In world W1, the question Q=”Is there a diamond in the room?” is commonly understood to mean Q1=”Is there actually a diamond in the room?”
In world W2 the question Q=”Is there a diamond in the room?” is commonly understood to mean Q2=”Do I believe there is a diamond in the room?”
Both worlds don’t know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of impossible situations where they differ, and the science fiction novels are different in these two worlds.
Is ELK required to answer appropriately in both worlds? (answer Q1 when given Q in W1, and Q2 when given Q in W2)? If so, it seems we need some term in the loss outside of the training set to make this happen.
Alternatively, would it be satisfactory to find a solution that doesn’t discriminate what’s world it is in, and instead returns “yes” to Q if and only if Q1=”yes” AND Q2=”yes”? This means that in world W1 there will be some situations where Q=”no” when the diamond is present, but no situations where Q=”yes” and the diamond is not present.
Are there other useful things that could be funded to get more evidence?
One thing that might be possible would be funding a larger survey, maybe with a more random sample of the population.
Is there any work going on that seems at all likely to decrease risk of getting long covid conditional on getting infected, beyond current vaccines? E.g. alternative vaccines with a story to suggest why they might decrease risk, evidence of Paxlovid or other treatments decreasing risk, other targeted research programs
Edit: think this isn’t quite right in general, will try to make it more correct later
Here’s a sketch of a strategy for trying to fix Strategy: penalize depending on “downstream” variables. Would appreciate feedback on whether it’s modeling the difficulty correctly/seems possibly worth figuring out how to implement
It seems like the problem is:
On the training set, there are a number of implicit variables X that are indistinguishable (always all true or always all false)
A. Is the diamond safe at time t-1
B. Is the diamond safe at time t (the variable we actually care about)
C. Do photons bounce off of the diamond at time t
D. Are photons in transit between the diamond and the camera at time t+1
E. Do photons arrive in the camera at time t+2
F. Do photons trigger the camera’s electronics at time t+3
G. Does the human correctly interpret the signal from the camera
H. Does the human believe “the diamond is safe”
The reporter can get low loss if it has a fast way to predict that value of any of these variables. This is fine on the training set, because the variables are all perfectly correlated.
But it might be possible to create ways for some of these variables to hold, but not others off the training set. The thing that is going wrong in Counterexample: “what the human will believe” can be understood before “what is actually true” seems like “E through H are true in this case, but A-D are false”.
Instead, we could try to learn a reporter that returns true if and only if all relevant variables A-H are true
Then the remaining problem should just be that this reporter is overly conservative: there might be some strategies that would be helpful that involve changing some of these variables, e.g. moving the diamond around the room (changing A) or turning off the lights in the room (changing C-H, which even causes the human to believe that the diamond isn’t safe)
I’ve been thinking of Case 2. It seems harder to establish “capable of distinguishing between situations where the user wants A vs B” on individual examples since a random classifier would let you cherrypick some cases where this seems possible without the model really understanding. Though you could talk about individual cases as examples of Case 2. Agree that there’s some implicit “all else being equal” condition, I’d expect currently it’s not too likely to change conclusions. Ideally you’d just have the category A=”best answer according to user” B=”all answers that are worse than the best answer according to the user” but I think it’s simpler to analyze more specific categories.
Link to contractor instructions implied in “You can read the instructions given to our contractors here” is missing.
I don’t think all work of that form would measure misalignment, but some work of that form might, here’s a description of some stuff in that space that would count as measuring misalignment.
Let A be some task (e.g. add 1 digit numbers), B be a task that is downstream of A (to do B, you need to be able to do A, e.g. add 3 digit numbers), M is the original model, M1 is the model after finetuning.
If the training on a downstream task was minimal, so we think it’s revealing what the model knew before finetuning rather than adding knew knowledge, then better performance of M1 than M on A would demonstrated misalignment (don’t have a precise definition of what would make finetuning minimal in this way, would be good to have a clearer criteria for that).
If M1 does better on B after finetuning in a way that implicitly demonstrates better knowledge of A, but does not do better on A when asked to do it explicitly, that would demonstrate that the finetuned M1 is misaligned (I think we might expect some version of this to happen by default though, since M1 might overfit to only doing tasks of type B. Maybe if you have a training procedure where M1 generally doesn’t get worse at any tasks then I might hope that it would get better on A and be disappointed if it doesn’t).