Jason R Brown

Karma: 379

Jason R Brown 18 Jun 2026 14:26 UTC
3 points
−3
on: Jason R Brown’s Shortform
I really want to write more LessWrong posts and I have a few ideas / things sketched out. I thought it might be fun to use Manifold to allow people to bet on how well they might do or whether I’ll get round to writing them: https://manifold.markets/Jasonb/how-interesting-are-my-different-id.

Jason R Brown 4 Jun 2026 14:08 UTC
1 point
0
on: Taking the Training Wheels Off: Aligning LLMs without Personas
Interesting post!

To what extent do you think this being useful / important is correlated with the Natural Abstraction Hypothesis? This feels like the crux to me.

If some version of NAH is correct, then maybe desirable personas cluster around the natural form of goodness / alignment we desire, and so extrapolating from them will likely be very useful. It might even be the ways in which they don’t cluster around this might be correctable in some natural way that still makes personas a useful starting point.

However, if NAH doesn’t hold, or at least doesn’t hold between humans/personas and superintelligences, then it does seem like personas are much less useful and are very unlikely to meaningfully capture / guide ASI towards the target we want.

Jason R Brown 15 Apr 2026 19:13 UTC
1 point
0
on: From personas to intentions: towards a science of motivations for AI models
I think this is very relevant if you’ve not already seen it: https://arxiv.org/abs/2406.06560

Jason R Brown 23 Oct 2025 12:14 UTC
2 points
0
on: A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
Another relevant market predicting in what years CoT monitoring will not work: https://manifold.markets/Jasonb/in-what-years-will-cot-monitoring-f?r=SmFzb25i

Jason R Brown 16 Oct 2025 10:57 UTC
1 point
0
on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
UPDATE:

The projects have been chosen! They are:
- Cameron Tice: Goal Crystallisation
- Puria Radmard & Shi Feng: Exploring more meta-cognitive capabilities of LLMs
- Lennie Wells: Model organisms resisting generalisation
These markets will be left locked until their individual metrics are resolvable, all other markets for the un-chosen projects will be resolved N/A.

Thank you to everyone who traded on these markets, and special thanks to those who provided feedback about the research projects and the futarchy experiment itself.

Jason R Brown 5 Oct 2025 16:57 UTC
2 points
0
in reply to: rotatingpaguro’s comment on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
Ahh I see what you mean now, thank you for the clarification.

I agree that in general people trying to exploit and Goodhart LW karma would be bad, though I hope the experiment would not contribute to this. Here, post karma is only being used as a measure, not as a target. The mentors and mentees gain nothing beyond what any other person would normally gain by their research project resulting in a highly-upvoted LW post. Predicted future post karma is just being used optimise over research ideas, and the space of ideas itself is very small (in this experiment) and I doubt we’ll get any serious Goodharting by selection of them that are perhaps not very good research but likely to produce particularly mimetic LW posts (and even then this is part of the motivation of having several metrics, so that none get too specifically optimised for).

There is perhaps an argument that those who have predicted a post would get high karma might want to manipulate it up to make their prediction come true, but those who predicted it would be lower have the opposite incentive. Regardless of that, that kind of manipulation is I think quite strictly prohibited by both LW and Manifold guidelines, and anyone caught doing it in a serious way would likely be severely reprimanded. In the worst case, if any of the metrics are seriously and obviously manipulated in a way that cannot be rectified, the relevant markets will be resolved N/A, though I think this happening is extremely low probability.

All that said, I think it is important to think about what more suitable / better metrics would be, if research futarchy was to become more common. I can certainly imagine a world where widespread use of LW post karma as a proxy for research success could have negative impacts on the LW ecosytem, though I hope by then there will have been more development and testing of robust measures beyond our starting point (which, for the record, I think is somewhat robust already).

Jason R Brown 3 Oct 2025 16:25 UTC
1 point
0
in reply to: danielms’s comment on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
Thank you for the suggestion!

Jason R Brown 3 Oct 2025 16:23 UTC
2 points
0
in reply to: rotatingpaguro’s comment on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
Thanks for your engagement with the post. I’m not quite sure I understand what you’re getting at? Please could you elaborate?

Jason R Brown 30 Sep 2025 21:59 UTC
2 points
0
in reply to: niplav’s comment on: AI Safety Research Futarchy: Using Prediction Markets to Choose Research Projects for MARS
Thank you for partaking!

Your linked experiment looks very interesting, I will give it a read, thank you for the heads up.

Will you randomize (some of) your choices, as dynomight suggests?

We’re not going to randomise choices as the symmetry of the sorts of actions being chosen combined with the fact that the market both makes the decision and mentors are trading on the markets (as suggested by Hanson) means we shouldn’t suffer any of the weird oddities decision markets can theoretically occasionally suffer from.

Jason R Brown 15 Aug 2025 9:12 UTC
5 points
0
in reply to: otto.barten’s comment on: I am worried about near-term non-LLM AI developments
I’ve ended up making another post somewhat to this effect, trying to predict any significant architectural shifts over the next year and a half: https://manifold.markets/Jasonb/significant-advancement-in-frontier

Jason R Brown 31 Jul 2025 16:00 UTC
44 points
2
on: I am worried about near-term non-LLM AI developments
I made a manifold post for this for those who wish to bet on it: https://manifold.markets/JasonBrown/will-a-gpt4-level-efficient-hrm-bas?r=SmFzb25Ccm93bg

Jason R Brown 9 Jun 2025 9:21 UTC
1 point
0
on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Thank you for providing a good introduction and arguments in favour of this research direction. Whilst I strongly agree with the idea of safety pre-training being valuable (and have even considered working on it myself with some collaborators), I think there are several core claims here that are false and that ultimately one should not consider alignment to be solved.
1. RL is not as bad as you make it out to be. The short version is that RL is about reinforcing good behaviour that has been sampled from the model. The model does not have access to the reward signal directly, and is not engaging in planning or search to maximise received reward during the RL process by default (though it could learn to start doing this if it was sufficiently intelligent, self-aware, and trained via RL long enough). The longer version is here and here.
2. The output distribution of an SFT’d model is not the training distribution, even with cross-entropy loss, unless you’re training on non-adversarial data and sampling the model with no conditioning. Data poisoning attacks can happen and influence outputs much more strongly than would be expected by the proportion of them in the training data. When you prompt an LLM to be a capable chat or agentic model, you’re already moving quite far out of its normal training distribution, and so you cannot make good predictions on how it will behave based on general proportions of good / bad training data and its annotations.
3. A lot of my probability mass for AIs doing bad things comes from a mixture of out-of-context reasoning and situational awareness leading to unexpectedly intelligent behaviour. Models have been shown to be capable of doing these. I’d predict both of these (especially co-occurring) would degrade the extent to which an alignment token / conditioning would work.
4. There might be other capabilities models acquire on the way to stronger performance like the aforementioned ones that interact with this in a negative way.
5. I don’t think this actually addresses inner alignment as effectively as you imagine. I think in the situation you’re considering where you prompt the model with this alignment conditioning, it’s not guaranteed that the model will have correctly generalised the meaning of this objective from what it saw in pre-training to being a totally aligned agent. I agree it probably helps a lot, but you still face the same-old inner-alignment issues of correctly generalising something that was seen in (pre-)training to deployment, when deployment looks different to training. This is somewhat of a generalisation of points 2-4 above.
6. Whilst this helps a lot with outer-alignment, I don’t think it solves that completely either. Yes models are able to recognise and probably correctly annotate a lot of data for pre-training, but are we really confident this is going to effectively capture human values, or some coherent-extrapolated-volition thereof? Even with all annotations correct, this objective seems like “output nice, safe-looking text that a human might output” and not “genuinely optimise for human values”. Value alignment aside, it’s probable this method will greatly help with intent alignment, and whilst intent alignment is probably a good target for current frontier AI, it that comes with many of its own problems, primarily misuse, where another large chunk of my P(doom) comes from.
7. I also want to generalise points 3, 4, 6, and what Steven Byrnes is claiming into: Training a model to act like an aligned human-level intelligence is not the same as training a model to act like an aligned super-intelligence, and whatever we do to raise capabilities here may also alter or break alignment, and so cannot be relied upon.
TL;DR I think safety pre-training is probably a huge boost to alignment, but our work is far from done and there are still lots of issues / uncertainties.

Jason R Brown 15 Apr 2025 12:16 UTC
1 point
0
in reply to: Seth Herd’s comment on: Breaking down the MEAT of Alignment
Thank you!

Your post was also very good and I agree with its points. I’ll probably edit my post in the near future to reference it along with some of the other good references your post had that I wasn’t aware of.

Jason R Brown 18 Jun 2022 20:36 UTC
1 point
0
in reply to: lorepieri’s comment on: Quantifying General Intelligence
Yes, it’s still unclear how to measure modification magnitude in general (or if that’s even possible to do in a principled way) but for modifications which are limited to text, you could use the entropy of the text and to me that seems like a fairly reasonable and somewhat fundamental measure (according to information theory). Thank you for the references in your other comment, I’ll make sure to give them a read!

Jason R Brown 18 Jun 2022 7:24 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Quantifying General Intelligence
Thank you, this looks very interesting