Former science writer covering biology, ecology, and computer science. Now senior conversation designer and amateur AI researcher, currently focused on novel methods of context design and developing UX best practices for AI users.
Eli Fulmen
It’s interesting that what you’re describing—detailed requirements, edge cases, tradeoffs, user-facing considerations—maps closely onto what design disciplines have been doing for decades. As implementation gets automated, if the remaining hard problems are fundamentally about problem definition rather than problem solving, I hope (perhaps naively) that this means designers and engineers might start exploring this territory together rather than separately.
Agreed. And this argument stands without having to adopt the framework of QI, something that is possibly difficult for someone undergoing mental health crisis, which often affects the the prefrontal cortex in such a way that the very high-level reasoning this post requires is significantly impaired. The post models ‘rationalist who has rejected standard arguments against suicide’ but the target population for an actual intervention is more likely ‘rationalist in acute neurological crisis,’ who requires different tools.
I posted this before I saw HoldenKarnofsky’s candid thoughts on the RSP update posted the same day. It makes me even more interested to hear how they’re defining “internalized values,” especially if, as Holden says in the post, “there was an enormous amount of pressure to declare our systems to lack relevant capabilities, to declare our risk mitigations to be on track to be strong enough, etc.” Imprecisely defined claims about internalized values and propensity seem like the kind of thing that might benefit from that optimism.
Also, if the RSP v3 is explicitly designed to balance risk reduction with business needs, I am curious what follows for the approach to welfare, given that Anthropic has been a leader in that space.
Eli Fulmen’s Shortform
Genuine question: Is there a Safety/Welfare Entailment in Anthropic’s Safety Report?
I’ve been lurking on LessWrong waiting for this exact topic to come up, but I haven’t seen it yet so I suppose I’ll pose it.
The recent Anthropic Sabotage Risk report says (in section 4.1.2):
“These changes aim to more firmly establish in the model a set of human-like positive traits such as honesty, warmth, intellectual curiosity, and a prosocial disposition (similar to the traits described in the Claude Constitution), and to instill these traits in a way that would cause the model to generalize them to novel scenarios as an idealized wise and morally serious human might. We believe our alignment assessment gives some evidence that these were largely effective, suggesting that the model has internalized a set of values and goals that are unlikely to be consistent with coherently pursuing misaligned goals.”
And this appears to be a load-bearing statement for much of the report. This claim feeds the ‘lack of propensity’ mitigating factor, which appears as ‘Strong’ or ‘Moderate’ across all risk pathways the report evaluates.
I don’t have a strong technical ML background, so I want to understand what ‘internalized’ means here.
If it means something technical, what is the mechanism, how robust is it, and why should we trust it as load-bearing given that the report itself acknowledges the limited understanding of model internals (Section 4.3.4)? This was my initial read, I don’t believe Anthropic has defined what they meant more precisely anywhere, and” Human-like persona and set of values” is not generally how I have seen machine learning engineers describe robust weight encodings.If it means that the model genuinely holds values the way a person might, is there not an entailment to welfare that hasn’t been thoroughly addressed?
I want to be clear that I’m not accusing Anthropic of being intentionally misleading or incoherent, but rather trying to understand if Anthropic’s published documents contain an argument that Anthropic has not yet fully assembled.I’ve really appreciated this community’s analysis of the system card and sabotage report, and I’m curious whether others have noticed this ambiguity.
I would never dare to suggest more meetings. I drown in meetings as is. I meant less that the process changes and more that design expertise becomes more visibly valuable, given that designers have been solving these issues professionally for a long time and have a lot of valuable insight that is often otherwise siloed away from engineers.