Thank you for the suggestion!
JasonBrown
Thanks for your engagement with the post. I’m not quite sure I understand what you’re getting at? Please could you elaborate?
Thank you for partaking!
Your linked experiment looks very interesting, I will give it a read, thank you for the heads up.
Will you randomize (some of) your choices, as dynomight suggests?
We’re not going to randomise choices as the symmetry of the sorts of actions being chosen combined with the fact that the market both makes the decision and mentors are trading on the markets (as suggested by Hanson) means we shouldn’t suffer any of the weird oddities decision markets can theoretically occasionally suffer from.
I’ve ended up making another post somewhat to this effect, trying to predict any significant architectural shifts over the next year and a half: https://manifold.markets/Jasonb/significant-advancement-in-frontier
I made a manifold post for this for those who wish to bet on it: https://manifold.markets/JasonBrown/will-a-gpt4-level-efficient-hrm-bas?r=SmFzb25Ccm93bg
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you’re already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I’d predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
I don’t think this actually addresses inner alignment as effectively as you imagine. I think in the situation you’re considering where you prompt the model with this alignment conditioning, it’s not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
Whilst this helps a lot with outer-alignment, I don’t think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like “output nice, safe-looking text that a human might output” and not “genuinely optimise for human values”. Value alignment aside, it’s probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.
Thank you!
Your post was also very good and I agree with its points. I’ll probably edit my post in the near future to reference it along with some of the other good references your post had that I wasn’t aware of.
Yes, it’s still unclear how to measure modification magnitude in general (or if that’s even possible to do in a principled way) but for modifications which are limited to text, you could use the entropy of the text and to me that seems like a fairly reasonable and somewhat fundamental measure (according to information theory). Thank you for the references in your other comment, I’ll make sure to give them a read!
Thank you, this looks very interesting
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
Ahh I see what you mean now, thank you for the clarification.
I agree that in general people trying to exploit and Goodhart LW karma would be bad, though I hope the experiment would not contribute to this. Here, post karma is only being used as a measure, not as a target. The mentors and mentees gain nothing beyond what any other person would normally gain by their research project resulting in a highly-upvoted LW post. Predicted future post karma is just being used optimise over research ideas, and the space of ideas itself is very small (in this experiment) and I doubt we’ll get any serious Goodharting by selection of them that are perhaps not very good research but likely to produce particularly mimetic LW posts (and even then this is part of the motivation of having several metrics, so that none get too specifically optimised for).
There is perhaps an argument that those who have predicted a post would get high karma might want to manipulate it up to make their prediction come true, but those who predicted it would be lower have the opposite incentive. Regardless of that, that kind of manipulation is I think quite strictly prohibited by both LW and Manifold guidelines, and anyone caught doing it in a serious way would likely be severely reprimanded. In the worst case, if any of the metrics are seriously and obviously manipulated in a way that cannot be rectified, the relevant markets will be resolved N/A, though I think this happening is extremely low probability.
All that said, I think it is important to think about what more suitable / better metrics would be, if research futarchy was to become more common. I can certainly imagine a world where widespread use of LW post karma as a proxy for research success could have negative impacts on the LW ecosytem, though I hope by then there will have been more development and testing of robust measures beyond our starting point (which, for the record, I think is somewhat robust already).