Even better than “Getting models to explain why they’re doing what they’re doing in simpler terms that connect to things the human overseers understand” would be getting models to actually do the task in ways that are simpler and connect to things that human overseers understand. E.g. if a model can solve a task in multiple steps by looking up relevant information by doing internet searches that are recorded and readable by the overseer instead of using knowledge opaquely measured in the weights, that seems like a step in the right direction.
One easy way to make people who can’t solve the task for sandwiching is to take people who could solve the task and then give them insufficient time to solve it, or have them be uninformed of some relevant facts about the specific task they are trying to solve.
A simpler way to measure whether you are making progress towards sandwiching if you can’t go there directly is to look at whether you can get people to provide better supervision with your tool than without your tool, that is accomplishing more on the task.
Both of these approaches feel like they aren’t quite solving the whole problem, because ultimately we want systems that help humans supervise tasks where they haven’t developed the right concepts, or couldn’t understand them even with years of study.
Here is a regularly updated version of the vaccine chart https://covid19tracker.ca/vaccinegap.html
If the High-Rated Sentence Producer was restricted to output only single steps of a mathematical proof and the single steps were evaluated independently, with the human unable to look at previous steps, then I wouldn’t expect this kind of reward hacking to occur. In math proofs, we can build proofs for more complex questions out of individual steps that don’t need to increase in complexity.
As I see it, debate on arbitrary questions could work if we figured out how to do something similar, having arguments split into single steps and evaluated independently (as in the recent OpenAI debate work), such that the debate AI can tackle more complicated questions with steps that are restricted to the complexity that humans can currently work with. Hard to know if this is possible, but still seems worth trying to work on.
For the preference learning skepticism, does this extend to the research direction (that isn’t yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the “AI-assisted deliberation” direction from ARCHES.To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we’d still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)It also seems like areas like FairML and Computational Social Choice will require preference learning as components—my guess is that people’s exact preferences about fairness won’t have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.
One thing I’d like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.
It seems that in principle a version of debate where only one agent makes statements and the other chooses which statements to expand could work, but it seems like it requires the judge to be very strict that the statement is 100% true. It seems hard to apply this kind of system to statements outside of formal mathematics.Systems where both agents can make statements seem like they might be less vulnerable to judges accepting statements that aren’t 100% true. For one example, if both agents take turns being the arguer, then if both agents submit a path that is judged to be correct, you can stipulate that the agent with the shortest path wins (like imposing a simplicity prior).
HCH could implement the decomposition oracle by searching over the space of all possible decompositions (it would just be quite expensive).
https://www.kialo.com/ lets people build debates on controversial topics in a heirarchical structure (more like stock debate, with both sides providing arguments), but doesn’t seem to have been used for explanations/arguments. I’d also be pretty interested to see more attempts at heirarchical explanations.
I think there are situations where you can still have subproblems where the output of the subproblem is long. A contrived example: suppose you have a problem where you want to calculate XOR(f(a), f(b)), where f(a) and f(b) are long strings. It seems reasonable to decompose into x=f(a), y=f(b), z=XOR(x, y), despite x and y being long, because there’s a simple way to combine them.If we had an AI system that could work on “making progress on a problem for an hour”, then write down a complete description of everything it had figured out and pass that to another AI system, I’d count that as dividing the problem into subproblems, just in a way that’s probably inefficient.I’d evaluate decompositions into subproblems by something like the total cost of solving a problem by dividing it into subproblems. Some decompositions would be efficent and others would be inefficient, sometimes this would be because the output is large but in other cases it could be because it takes a long time to write the input, or because there’s a lot of work repeated between subproblems.
Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer “Would A(Z) output answer Y to question X?”, as opposed to asking A to answer “X”, and then checking if it equals “Y”. This can at most be as hard as running the original system, and maybe could be much more efficient.
From the COVID Symptom Study in the UK (app based questionaire), “10 per cent of those taking part in the survey had symptoms of long Covid for a month, with between 1.5 and 2 per cent still experiencing them after three months”, and they claim “long Covid is likely a bigger issue than excess deaths as a result of Covid, which are between 0.5 per cent and 1 per cent”.
App-based survey, so not necessarily representative of population. Not clear how severe the 3 month cases are, though they state “The most common reported symptom has been described by doctors as “profound fatigue””. Article also summarizes other related studies.
Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it’s simple to verify the results of amplification/debate systems?)
for extremely large Z which are represented only implicitly as in Paul’s post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given Z
I’m not sure what “just ask the human to verify the answer given Z” looks like, for implicitly represented Z
I’m skeptical of this.
Wuhan needed 2 months on lockdown: https://en.wikipedia.org/wiki/COVID-19_pandemic_lockdown_in_Hubei
I’d expect that imposing China-style lockdowns in the West would require significant force and might end up causing a large-scale panic in and of itself.
I’d expect that any lockdown in the West wouldn’t have been effective enough to stamp out 100% of cases, and if you don’t eradicate it then you need ongoing measures or it will just flare up again later, so one strictly enforced lockdown wouldn’t cut it. (Though maybe you could do very rigorous contact tracing and lock down just people who might have been in contact with cases, which could be less costly than full lockdown but probably still need significant enforcement).
https://www.microcovid.org/paper/2-riskiness#fn6 discusses https://covid.joinzoe.com/post/covid-long-term which has an app-based survey claiming 1 in 10 people still have symptoms after 3 weeks. (but since people can just sign up for the app I’d guess this is harder to know how to interpret than the telephone survey). Microcovid.org uses this 1 in 10 figure as the estimate for chance of some ongoing health consequence, and claims the risk of ongoing health problems from a 1% chance of COVID is equivalent to the risk from 1 year of driving (but this comparison involves even more assumptions).
https://www.cdc.gov/mmwr/volumes/69/wr/mm6930e1.htm found that ~1 in 5 of 18-34 year olds with no underlying health conditions had symptoms 3 weeks later (telephone survey of people who’d been symptomatic and had a positive test).
Other discussion in comments of https://www.lesswrong.com/posts/ahYxBHLmG7TiGDqxG/do-we-have-updated-data-about-the-risk-of-permanent-chronic
Not addressing fatigue, and just a study in progress, but this study is looking for long term neurological problems, might another weak bit of evidence when it releases results https://www.cambridgebrainsciences.com/studies/covid-brain-study
Seems like it could be helpful if people who’ve thought about this would also predict on the question of what the survey value would be today. (e.g. via elicit snapshots)
Google’s Meena (2.6 billion parameters, February 2020) creates original puns in “Conversation D”, and I think “Cross-turn Repetition Example 2″ is absurdist comedy, but maybe more as a result of the way the model fails.
Human: I like how cute cows are
Meena: Ah, interesting. You like how they are cute, right?
Human: yeah, and they’re realy smart
Meena: I heard they go to college
Human: cows go to college?
Meena: I heard that a cow went to Harvard.
Human: what did the cow study?
Meena: Bovine sciences.
Human: do horses go to Harvard?
Meena: Horses go to Hayvard.
Human: that’s a pretty good joke, I feel like you led me into it
Meena: You were trying to steer it elsewhere, I can see it.