Safety researcher at OpenAI.
Views are my own and not those of my employer.
IMO Eleuther should probably spend more time doing things like this and less on scaling LMs
It seems to me like this should be pretty easy to do and I’m disappointed there hasn’t been more action on it yet. Things I’d try:- reach out to various human-data-as-a-service companies like SurgeHQ, Scale, Samasource- look for people on upwork—find people who write fiction on the internet (e.g. post on fanfiction forums) and offer to pay them to annotate their existing stories (not a dungeon run exactly, but I don’t see why the dungeon setting is important)I’d be interested to hear if anyone has tried these things and run into roadblocks.I’m also interested if anyone has an explanation of why the focus is on the dungeon thing in particular rather than e.g. fiction generally.One concern I’d have with this dataset is that the thoughts are post-hoc rationalizations for what is written rather than actually the thought process that went into it. To reduce this, you could do something like split it so one person writes the thoughts, and someone else writes the next step, without other communication.
Seems like a simplicity prior over explanations of model behavior is not the same as a simplicity prior over models? E.g. simplicity of explanation of a particular computation is a bit more like a speed prior. I don’t understand exactly what’s meant by explanations here. For some kinds of attribution, you can definitely have a simple explanation for a complicated circuit and/long-running computation—e.g. if under a relevant input distribution, one input almost always determines the output of a complicated computation.
crossposting my comments from Slack thread:
Here are some debate trees from experiments I did on long-text QA on this example short story:
Debater view 1
Debater view 2
Our conclusion was that we don’t expect debate to work robustly in these cases. In our case this was mostly because in cases where the debate is things like ’is there implied subtext A?’, human debaters don’t really know why they believe some text does or doesn’t have a particular implication. They have some mix of priors about what the text might be saying (which can’t really be justified with debate), and various updates to that based on style, word choice, etc, where humans don’t necessarily have introspective access to what exactly in the text made them come to the conclusion.My guess is that’s not the limitation you’re running into here—I’d expect that to just be the depth.
There are other issues with text debates, like if the evidence is distributed across many quotes that each only provide a small amount of evidence—in this case the honest debater needs to have decent estimates for how much evidence each quote provides, so they can split their argument into something like ‘there are 10 quotes that weakly support position A’; ‘the evidence that these quotes provide is additive rather than redundant’.
[edited to fix links]
I think I’m something like 30% on ‘The highest-leverage point for alignment work is once we have models that are capable of alignment research—we should focus on maximising the progress we make at that point, rather than on making progress now, or on making it to that point—most of the danger comes after it’
Things this maybe implies:
We should try to differentially advance models’ ability to do alignment research relative to other abilities (abilities required to be dangerous, or abilities required to accelerate capabilities)
For instance, trying to make really good datasets related to alignment, e.g. by paying humans to proliferate/augment all the alignment research and writing we have so far
Figuring out what combination of math/code/language/arxiv etc seem to be the most conducive to alignment-relevant capabilities
More generally, researching how to develop models that are strong in some domains and handicapped in others
We should focus on getting enough alignment to extract the alignment research capabilities
This might mean we only need to align:
models that are not agentic/not actively trying to deceive you
Models that in many domains are subhuman
If we think these models are going to be close to having agency, maybe we want to avoid RL or other finetuning that incentivizes the model to think about its environment/human supervisors. Instead we might want to use some techniques that are more like interpretability or extracting latent knowledge from representations, rather than RLHF?
We should think about how we can use powerful models to accelerate alignment
We should focus more on how we would recognise good alignment research as opposed to producing it
For example, setups where you can safely train a fairly capable model according to some proposed alignment scheme, and see how well it works?
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
IMO, the alignment MVP claim Jan is making is approximately ‘’we only need to focus on aligning narrow-ish alignment research models that are just above human level, which can be done with RRM (and maybe some other things, but no conceptual progress?)’’and requires:
we can build models that are:
Not dangerous themselves
capable of alignment research
We can use RRM to make them aligned enough that we can get useful research out of them.
We can build these models before [anyone builds models that would be dangerous without [more progress on alignment than is required for aligning the above models]]
We have these models for long enough before danger and/or the models speed up alignment progress by enough that the alignment progress made during this time is comparably large to or larger than the progress made up to that date.
I’d imagine some cruxes to include: - whether it’s possible to build models capable of somewhat superhuman alignment research that do not have inner agents- whether people will build systems that require conceptual progress in alignment to make safe before we can build the alignment MVP and get significant work out of it
As written there, the strong form of the orthogonality thesis states ‘there’s no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.’
I don’t know whether that’s intended to mean the same as ‘there are no types of goals that are more ‘natural’ or that are easier to build agents that pursue, or that you’re more likely to get if you have some noisy process for creating agents’.
I feel like I haven’t seen a good argument for the latter statement, and it seems intuitively wrong to me.
Yeah, I’m particular worried about the second comment/last paragraph—people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)
Random small note—the ‘dungeon’ theme is slightly …culturally offputting? or something for me, as someone who’s never been into this kind of thing or played any of these and is therefore a bit confused about what exactly this involves, and has vague negative associations (I guess because dungeons sound unpleasant?). I wonder if something a bit blander like a story, play, or AI assistant setting could be better?
Someone who wants to claim the bounty could just buy the dataset from one of the companies that does this sort of thing, if they’re able to produce a sufficiently high-quality version, I assume? Would that be in the spirit of the bounty?
Not sure what you mean by ‘Hobbesian state of nature founding assumptions’, although I’ll admit I’m pretty sympathetic to Hobbesian view. You mean the claim about most creatures living in a Malthusian struggle? Do you think that’s not true of non-human animals, or humans prior to availability of birth control? Or is your claim more like there’s something about humans that should be viewed as a stable trend away from Malthusianism, not an anomaly?
some relevant ideas here maybe: https://reducing-suffering.org/what-are-suffering-subroutines/
I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I’m sufficiently uncertain about how consciousness works that I want to give some moral weight to ‘any computation at all’, and reasonable weight to ‘a computation structured like an agent’.
I think if you have malthusian dynamics you *do* have evolution-like dynamics.
I assume this isn’t a crux, but fwiw I think it’s pretty likely most vertebrates are moral patients
thanks, edited :)
It sounds like you’re implying that you need humans around for things to be dystopic? That doesn’t seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients