Jemal Young

Karma: 103

I’m interested in the technical problems and policy challenges of advanced AI development.

Jemal Young 14 May 2026 16:11 UTC
1 point
0
on: What I did in the hedonium shockwave, by Emma, age six and a half
This is inspired and poetic and horrible. I hate it. Strong upvote.

Jemal Young 6 Mar 2026 17:24 UTC
1 point
−9
in reply to: Andrew_Critch’s comment on: Andrew_Critch’s Shortform
“AI alignment might be doable in the short term but ultimately unsustainable, because humanity might find itself inside an increasingly complex layering of automated research/monitoring/control systems, with each layer interfacing with a more capable layer on one side and a less capable layer on the other, and as this layering accumulates the nougaty center’s awareness of / influence over the outermost layer (the thing that needs to be aligned) will approach zero.”

Jemal Young 11 Feb 2026 18:50 UTC
2 points
0
in reply to: David Africa’s comment on: What concrete mechanisms could lead to AI models having open-ended goals?
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
This feels plausible to me but handwavy, if the idea is that such preferences would be decoupled from the training-reinforced preference to complete an intended task. Is that what you meant? I’m reminded of this Palisade study on shutdown resistance, where across the board, the models expressed wanting to avoid shutdown to complete the task.
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like “acquire resources.” If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then it has open-ended goals.
This makes sense to me as a possible concrete mechanism to keep an eye out for.
Not sure if I agree wrt instrumental convergence. I think you’re assuming the system knows the parent goal has been accomplished with certainty
I’m assuming the pattern we’re seeing so far will hold, which is that models satisfice rather than try to figure out how to maximize their certainty of a goal being accomplished. The “become a maximizer to minimize uncertainty” thing isn’t empirically grounded, so far.
and more importantly that the parent goal can be accomplished in a terminal sense. Many real training objectives don’t have neat termination conditions. A model trained to “be helpful” or “maximize user engagement” has no natural stopping point.
Hm. Models are trained to “be helpful” now, and they stop just fine. I do agree that “maximize user engagement” has no natural stopping point; it’s the kind of concrete mechanism I tried to capture in number 1 above (Training on open-ended tasks).

[Question] What concrete mechanisms could lead to AI models having open-ended goals?

Jemal Young11 Feb 2026 9:08 UTC

10 points

4 comments1 min readLW link

Jemal Young 7 Aug 2025 2:10 UTC
2 points
2
in reply to: Fabien Roger’s comment on: How useful could stolen AI model weights be without knowing the architecture and activation functions?
The scenario seems unrealistic because of the thieves would likely be able to steal important parts of the codebase.
Thanks for this. So I guess when knowledgeable people talk about stealing a model’s weights as being equivalent to stealing the model itself, “steal the weights” is shorthand that implies also stealing ~~the minimal~~ *other elements you’d need to replicate the model. [Edit: changed “the minimal” to “other”]

Jemal Young 6 Aug 2025 19:23 UTC
1 point
0
in reply to: Brendan Long’s comment on: How useful could stolen AI model weights be without knowing the architecture and activation functions?
This is a helpful answer, thank you! Thanks also for the link to the HF article on common model formats.

[Question] How useful could stolen AI model weights be without knowing the architecture and activation functions?

Jemal Young6 Aug 2025 17:36 UTC

6 points

5 comments1 min readLW link

Jemal Young’s Shortform

Jemal Young29 Apr 2025 22:04 UTC

2 points

2 comments1 min readLW link

Jemal Young 29 Apr 2025 22:04 UTC
4 points
−1
on: Jemal Young’s Shortform
Not saying AI models can’t be moral patients, but 1) if the smartest models are probably going to be the most dangerous, and 2) if the smartest models are probably going to be the best at demonstrating moral patienthood, then 3) caring too much about model welfare is probably dangerous.

Safe Search is off: root causes of AI catastrophic risks

Jemal Young31 Jan 2025 18:22 UTC

4 points

0 comments3 min readLW link

Jemal Young 25 Sep 2024 18:15 UTC
2 points
0
in reply to: Richard_Ngo’s comment on: The Sun is big, but superintelligences will not spare Earth a little sunlight
You only set aside occasional low-value fragments for national parks, mostly for your own pleasure and convenience, when it didn’t cost too much?
Earth as a proportion of the solar system’s planetary mass is probably comparable to national parks as a proportion of the Earth’s land, if not lower.
Maybe I’ve misunderstood your point, but if it’s that humanity’s willingness to preserve a fraction of Earth for national parks is a reason for hopefulness that ASI may be willing to preserve an even smaller fraction of the solar system (namely, Earth) for humanity, I think this is addressed here:
it seems like for Our research purposes simulations would be just as good. In fact, far better, because We can optimize the hell out of them, running it on the equivalent of a few square kilometers of solar diameter
“research purposes” involving simulations can be a stand-in for any preference-oriented activity. Unless ASI would have a preference for letting us, in particular, do what we want with some fraction of available resources, no fraction of available resources would be better left in our hands than put to good use.

Can efficiency-adjustable reporting thresholds close a loophole in Biden’s executive order on AI?

Jemal Young11 Jun 2024 20:56 UTC

4 points

1 comment2 min readLW link

Jemal Young 31 May 2024 18:30 UTC
3 points
0
on: What’s a better term now that “AGI” is too vague?
I think the kind of AI you have in mind would be able to:
continue learning after being trained
think in an open-ended way after an initial command or prompt
have an ontological crisis
discover and exploit signals that were previously unknown to it
accumulate knowledge
become a closed-loop system
The best term I’ve thought of for that kind of AI is Artificial Open Learning Agent.

Jemal Young 8 May 2024 23:44 UTC
1 point
2
in reply to: Davidmanheim’s comment on: How do top AI labs vet architecture/algorithm changes?
Thanks for this answer! Interesting. It sounds like the process may be less systematized than how I imagined it to be.

Jemal Young 8 May 2024 23:27 UTC
2 points
0
in reply to: Nevin Wetherill’s comment on: How do top AI labs vet architecture/algorithm changes?
Dwarkesh’s interview with Sholto sounds well worth watching in full, but the segments you’ve highlighted and your analyses are very helpful on their own. Thanks for the time and thought you put into this comment!

[Question] How do top AI labs vet architecture/algorithm changes?

Jemal Young8 May 2024 16:47 UTC

3 points

5 comments1 min readLW link

Jemal Young 27 Aug 2023 6:30 UTC
1 point
0
on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
I like this post, and I think I get why the focus is on generative models.
What’s an example of a model organism training setup involving some other kind of model?

Jemal Young 23 Aug 2023 17:39 UTC
3 points
0
on: Which possible AI systems are relatively safe?
Maybe relatively safe if:
- Not too big
- No self-improvement
- No continual learning
- Curated training data, no throwing everything into the cauldron
- No access to raw data from the environment
- Not curious or novelty-seeking
- Not trying to maximize or minimize anything or push anything to the limit
- Not capable enough for catastrophic misuse by humans

Jemal Young 7 Jul 2023 17:20 UTC
1 point
0
on: What are the best non-LW places to read on alignment progress?
Here are some resources I use to keep track of technical research that might be alignment-relevant:
- Podcasts: Machine Learning Street Talk, The Robot Brains Podcast
- Substacks: Davis Summarizes Papers, AK’s Substack
How I gain value: These resources help me notice where my understanding breaks down i.e. what I might want to study, and they get thought-provoking research on my radar.

Jemal Young 5 Jun 2023 4:22 UTC
3 points
0
on: Think carefully before calling RL policies “agents”
I’m very glad to have read this post and “Reward is not the optimization target”. I hope you continue to write “How not to think about [thing] posts”, as they have me nailed. Strong upvote.

Jemal Young

[Question] What con­crete mechanisms could lead to AI mod­els hav­ing open-ended goals?

[Question] How use­ful could stolen AI model weights be with­out know­ing the ar­chi­tec­ture and ac­ti­va­tion func­tions?

Je­mal Young’s Shortform

Safe Search is off: root causes of AI catas­trophic risks

Can effi­ciency-ad­justable re­port­ing thresh­olds close a loop­hole in Bi­den’s ex­ec­u­tive or­der on AI?

[Question] How do top AI labs vet ar­chi­tec­ture/​al­gorithm changes?