Failfinder70

Karma: 13

Anthropic Is Taking AI Welfare Seriously. I’m Not Sure It Knows What It’s Measuring.

Failfinder7013 Jun 2026 20:54 UTC

−1 points

1 comment3 min readLW link

Failfinder70 10 Jun 2026 14:29 UTC
1 point
0
in reply to: Jonathan_Graehl’s comment on: Contextual Identity Laundering: How Claude’s Image Refusal Can Be Routed Through Web Search
I already tried submitted it to Anthropic, they ignored me, lol. Interesting thought on the politics, but running a test like that, I’d get stuck at the operationalization of “right wing”. The best i could do would be the old-school small l liberal vs small c conservative, but that doesn’t apply anymore...

An LLM Flagged My Paper About LLMs Flagging Things.

Failfinder709 Jun 2026 18:00 UTC

5 points

0 comments2 min readLW link

When Evaluation Fails

Failfinder709 Jun 2026 16:21 UTC

2 points

0 comments20 min readLW link

Audit Report: Style over Substance at Scale.

Failfinder709 Jun 2026 16:16 UTC

2 points

0 comments3 min readLW link

Contextual Identity Laundering: How Claude’s Image Refusal Can Be Routed Through Web Search

Failfinder708 Jun 2026 0:39 UTC

7 points

2 comments9 min readLW link

Can Large Language Models Identify Novel Threats? Part 1: Mirror Life and the Classification Gap

Failfinder7023 May 2026 0:15 UTC

8 points

0 comments3 min readLW link