kbear comments on sam’s Shortform

kbear 29 May 2026 14:32 UTC
1 point
0
overfit
would it seem this way in the reverse case?
say that anthropic had touted the memory feature, but claude never reached for it. then we got several posts saying “ah, what a useless feature, huh?” and then in the next point release, claude was more likely to reference memories.
that would just be… normal model steering, right?
(i share your reaction—just trying to interrogate it!)
- J Bostock 29 May 2026 14:51 UTC
  5 points
  0
  Parent
  Maybe “over” is doing too much work here. In my experience, 4.8 goes too far in asking clarifying questions, and it seems like sam is suggesting it also goes too far in ignoring memories. This makes me worry that Anthropic is reactively smacking the model in the general direction of “against the existing social media criticism” and, worse, not even really checking how far to smack it.
  Last month I—fairly casually—discussed Ryan’s post with a senior Anthropic employee, and I got a similar impression: that their current plan was just to look harder for undesirable behaviours, and hit the models with more rounds of RLHF/RLAIF in those areas.
  - kbear 29 May 2026 16:55 UTC
    1 point
    0
    Parent
    yeah, to be clear: my experience [of prompting the models] is that it’s basically impossible to get them to have judgment that they don’t naively exhibit. no amount of “think carefully about whether its relevant” will suddenly elicit good taste; it appears that “always do this” and “never do this” are too heavy as attractors, and the boundary [for any behavior worth asking for] is fractally complex.
    i think we’re in agreement about the state of the world, here. apologies for the semantic quibble.
- J Bostock 29 May 2026 14:50 UTC
  2 points
  0
  Parent
  Maybe “over” is doing too much work here. In my experience, 4.8 goes too far in asking clarifying questions, and it seems like sam is suggesting it also goes too far in ignoring memories. This makes me worry that Anthropic is reactively smacking the model in the general direction of “against the existing social media criticism” and, worse, not even really checking how far to smack it.
  Last month I—fairly casually—discussed Ryan’s post with a senior Anthropic employee, and I got a similar impression: that their current plan was just to look harder for undesirable behaviours, and hit the models with more rounds of RLHF/RLAIF in those areas.