Zach Stein-Perlman

Karma: 9,962

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.

Zach Stein-Perlman Jan 19, 2025, 6:55 PM
4 points
2
in reply to: JenniferRM’s comment on: Thoughts on the conservative assumptions in AI control
See The case for ensuring that powerful AIs are controlled.

Zach Stein-Perlman Jan 15, 2025, 11:15 PM
6 points
2
on: Labs should be explicit about why they are building AGI
[Perfunctory review to get this post to the final phase]
Solid post. Still good. I think a responsible developer shouldn’t unilaterally pause but I think it should talk about the crazy situation it’s in, costs and benefits of various actions, what it would do in different worlds, and its views on risks. (And none of the labs have done this; in particular Core Views is not this.)

List of AI safety papers from companies, 2023–2024

Zach Stein-PerlmanJan 15, 2025, 6:00 PM

11 points

0 comments1 min readLW link

Zach Stein-Perlman Jan 6, 2025, 5:30 AM
11 points
3
on: Reasons for and against working on technical AI safety at a frontier AI lab
One more consideration against (or an important part of “Bureaucracy”): sometimes your lab doesn’t let you publish your research.

Zach Stein-Perlman Jan 4, 2025, 5:05 PM
3 points
0
in reply to: Campbell Hutcheson’s comment on: Anthropic’s Certificate of Incorporation
Yep, the final phase-in date was in November 2024.

Zach Stein-Perlman Jan 2, 2025, 10:15 PM
LW: 20 AF: 12
5
AF
on: What’s the short timeline plan?
Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman’s checklist, or Holden Karnofsky’s list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don’t think these example lists were intended to be an extensive plan).
See also A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a detailed (but rough, out-of-date, and very high-context) plan.

Zach Stein-Perlman Dec 28, 2024, 12:16 AM
5 points
0
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Yeah. I agree/concede that you can explain why you can’t convince people that their own work is useless. But if you’re positing that the flinchers flinch away from valid arguments about each category of useless work, that seems surprising.

Zach Stein-Perlman Dec 28, 2024, 12:02 AM
4 points
−2
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I feel like John’s view entails that he would be able to convince my friends that various-research-agendas-my-friends-like are doomed. (And I’m pretty sure that’s false.) I assume John doesn’t believe that, and I wonder why he doesn’t think his view entails it.

Zach Stein-Perlman Dec 27, 2024, 5:06 PM
4 points
0
in reply to: Zach Stein-Perlman’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I wonder whether John believes that well-liked research, e.g. Fabien’s list, is actually not valuable or rare exceptions coming from a small subset of the “alignment research” field.
What links here?
- AI #97: 4 by Zvi (Jan 2, 2025, 2:10 PM; 45 points)

Zach Stein-Perlman Dec 27, 2024, 3:30 PM
6 points
1
in reply to: TsviBT’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I do not.
On the contrary, I think ~all of the “alignment researchers” I know claim to be working on the big problem, and I think ~90% of them are indeed doing work that looks good in terms of the big problem. (Researchers I don’t know are likely substantially worse but not a ton.)
In particular I think all of the alignment-orgs-I’m-socially-close-to do work that looks good in terms of the big problem: Redwood, METR, ARC. And I think the other well-known orgs are also good.
This doesn’t feel odd: these people are smart and actually care about the big problem; if their work was in the even if this succeeds it obviously wouldn’t be helpful category they’d want to know (and, given the “obviously,” would figure that out).
Possibly the situation is very different in academia or MATS-land; for now I’m just talking about the people around me.

Zach Stein-Perlman Dec 27, 2024, 2:55 PM
4 points
1
in reply to: TsviBT’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Yeah, I agree sometimes people decide to work on problems largely because they’re tractable [edit: or because they’re good for safety getting alignment research or other good work out of early AGIs]. I’m unconvinced of the flinching away or dishonest characterization.

Zach Stein-Perlman Dec 27, 2024, 2:06 AM
20 points
1
on: The Field of AI Alignment: A Postmortem, and What To Do About It
This post starts from the observation that streetlighting has mostly won the memetic competition for alignment as a research field, and we’ll mostly take that claim as given. Lots of people will disagree with that claim, and convincing them is not a goal of this post.
Yep. This post is not for me but I’ll say a thing that annoyed me anyway:
… and Carol’s thoughts run into a blank wall. In the first few seconds, she sees no toeholds, not even a starting point. And so she reflexively flinches away from that problem, and turns back to some easier problems.
Does this actually happen? (Even if you want to be maximally cynical, I claim presenting novel important difficulties (e.g. “sensor tampering”) or giving novel arguments that problems are difficult is socially rewarded.)

Zach Stein-Perlman Dec 26, 2024, 6:50 PM
37 points
0
on: Zach Stein-Perlman’s Shortform
DeepSeek-V3 is out today, with weights and a paper published. Tweet thread, GitHub, report (GitHub, HuggingFace). It’s big and mixture-of-experts-y; discussion here and here.
It was super cheap to train — they say 2.8M H800-hours or $5.6M (!!).
It’s powerful:
It’s cheap to run:

Zach Stein-Perlman Dec 26, 2024, 6:39 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: DeepSeek beats o1 on math and ties on coding; will release weights
oops thanks

Zach Stein-Perlman Dec 26, 2024, 6:30 PM
2 points
0
on: DeepSeek beats o1 on math and ties on coding; will release weights
Update: the weights and paper are out. Tweet thread, GitHub, report (GitHub, HuggingFace). It’s big and mixture-of-experts-y; thread on notable stuff.
It was super cheap to train — they say 2.8M H800-hours or $5.6M.
It’s powerful:
It’s cheap to run:

Zach Stein-Perlman Dec 24, 2024, 8:05 PM
4 points
2
on: Hire (or become) a Thinking Assistant / Body Double
Every now and then (~5-10 minutes, or when I look actively distracted), briefly check in (where if I’m in-the-zone, this might just be a brief “Are you focused on what you mean to be?” from them, and a nod or “yeah” from me).
Some other prompts I use when being a [high-effort body double / low-effort metacognitive assistant / rubber duck]:
- What are you doing?
- What’s your goal?
  - Or: what’s your goal for the next n minutes?
  - Or: what should be your goal?
- Are you stuck?
  - Follow-ups if they’re stuck:
    what should you do?
    can I help?
    have you considered asking someone for help?
    If I don’t know who could help, this is more like prompting them to figure out who could help; if I know the manager/colleague/friend who they should ask, I might use that person’s name
- Maybe you should x
- If someone else was in your position, what would you advise them to do?

Zach Stein-Perlman Dec 22, 2024, 3:30 PM
14 points
2
in reply to: Lukas Finnveden’s comment on: Anthropic leadership conversation
All of the founders committed to donate 80% of their equity. I heard it’s set aside in some way but they haven’t donated anything yet. (Source: an Anthropic human.)
This fact wasn’t on the internet, or rather at least wasn’t easily findable via google search. Huh. I only find Holden mentioning 80% of Daniela’s equity is pledged.

Zach Stein-Perlman Dec 22, 2024, 6:08 AM
9 points
5
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
I disagree with Ben. I think the usage that Mark is talking about is a reference to Death with Dignity. A central example (written by me) is
it would be undignified if AI takes over because we didn’t really try off-policy probes; maybe they just work really well; someone should figure that out
It’s playful and unserious but “X would be undignified” roughly means “it would be an unfortunate error if we did X or let X happen” and is used in the context of AI doom and our ability to affect P(doom).

Zach Stein-Perlman Dec 22, 2024, 3:16 AM
32 points
2
on: o3
edit: wait likely it’s RL; I’m confused
OpenAI didn’t fine-tune on ARC-AGI, even though this graph suggests they did.
Sources:
Altman said
we didn’t go do specific work [targeting ARC-AGI]; this is just the general effort.
François Chollet (in the blogpost with the graph) said
Note on “tuned”: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.
and
The version of the model we tested was domain-adapted to ARC-AGI via the public training set (which is what the public training set is for). As far as I can tell they didn’t generate synthetic ARC data to improve their score.
An OpenAI staff member replied
Correct, can confirm “targeting” exclusively means including a (subset of) the public training set.
and further confirmed that “tuned” in the graph is
a strange way of denoting that we included ARC training examples in the O3 training. It isn’t some finetuned version of O3 though. It is just O3.
Another OpenAI staff member said
also: the model we used for all of our o3 evals is fully general; a subset of the arc-agi public training set was a tiny fraction of the broader o3 train distribution, and we didn’t do any additional domain-specific fine-tuning on the final checkpoint
So on ARC-AGI they just pretrained on 300 examples (75% of the 400 in the public training set). Performance is surprisingly good.
[heavily edited after first posting]
What links here?

Zach Stein-Perlman Dec 21, 2024, 9:11 PM
5 points
2
in reply to: yo-cuddles’s comment on: o3
Welcome!
To me the benchmark scores are interesting mostly because they suggest that o3 is substantially more powerful than previous models. I agree we can’t naively translate benchmark scores to real-world capabilities.

Zach Stein-Perlman

List of AI safety pa­pers from com­pa­nies, 2023–2024

List of AI safety papers from companies, 2023–2024