Sheikh Abdur Raheem Ali

Karma: 492

Software Engineer (formerly) at Microsoft who may focus on the alignment problem for the rest of his life (please bet on the prediction market here).

Sheikh Abdur Raheem Ali 24 Apr 2026 1:50 UTC
2 points
0
in reply to: lc’s comment on: lc’s Shortform
Do we have any baseline for human performance on Vending Bench Arena or Vending Bench 2?

Sheikh Abdur Raheem Ali 24 Apr 2026 1:40 UTC
3 points
0
in reply to: Brendan Long’s comment on: Brendan Long’s Shortform
I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.

Sheikh Abdur Raheem Ali 23 Apr 2026 23:07 UTC
1 point
0
in reply to: flatstats’s comment on: Mechanisms of Introspective Awareness
Thank you for running this investigation. I’m confused about a few points.
Sorry, why is punctuation sensitivity an issue? Is this motivated by some concerning behavior you’ve observed?
Also, what exactly do you mean by “anomaly tiling”? I’ve heard similar terminology being used in March by Bradley Rae, did this term come from an LLM? I’m also confused by “upstream carrier population”, it seems like you’re borrowing the concept from epidemiology or biosecurity to make an analogy without explicitly describing the process you’re referring to, but let me know if I’ve misunderstood this.
I don’t clearly understand this project’s threat model, I would be grateful if you could expand on that.

Sheikh Abdur Raheem Ali 21 Apr 2026 20:07 UTC
2 points
1
in reply to: Thomas Kwa’s comment on: Quality Matters Most When Stakes are Highest
I’m not sure I agree with this framing. If a friend told me that they were struggling with improving the quality of their outputs, I don’t think that my first suggestion would be for them to put in more effort.

Sheikh Abdur Raheem Ali 20 Apr 2026 9:25 UTC
1 point
0
in reply to: evhub’s comment on: leogao’s Shortform
There are ML papers floating around with training methods and architectural tweaks (e.g Block AttnRes or mHC-lite) which might be incorporated into future models.
It seems plausible to me that replacing standard residual skip connections with something more complicated:
- scales intelligence somewhat but not past the frontier
- makes it slightly harder for existing interp flavored techniques to generate understanding
- doesn’t meaningfully affect the relative performance of linear probes vs output classifiers for inference-time detection of precursors to high-risk misaligned behavior.
I do think there are cases where models will be able to manipulate the data they’re feeding into white-box methods in a way that affects verdicts, but it’s hard to see these arising naturally before being demonstrated in more contrived scenarios, and I agree with evhub that this would be harder than circumventing black box safeguards.

Sheikh Abdur Raheem Ali 20 Apr 2026 8:24 UTC
3 points
2
in reply to: nostalgebraist’s comment on: R1 CoT illegibility revisited
More legible reasoning traces might be more monitorable, but almost all of the Astra Fellowship empirical mentors seem to be doing CoT monitorability research now (this is based on a quick skim of their profiles), and I don’t currently believe that this broad cluster of “CoT stuff” is all that important for x-risk, I personally think that it’s actively harmful (ref. Don’t Align Agents to Evaluations of Plans) and even if it wasn’t the field would still be grossly over-investing in CoT controllability/monitorability/legibility/etc for bad groupthink/hype/information-cascade-y reasons that mainly tracks Coefficient Giving grantmakers collectively having an inertia delayed reaction to “let’s think step by step” improving benchmark performance on math problems and thus leading to the unchecked proliferation of “effort level/scratchpad length” configurations into the roadmap of every lab.
And what’s the point? Even findings that lead to ~unanimous consensus amongst this crowd saying “never train on CoT” are basically ignored with CoT flagrantly leaking into RL training corpuses at least 8% of the time.
I’m not concerned at all by a model being “able to make better use of some text for reasoning than we have capacity to monitor it”, I’m utterly confused that this is even a concern and find it bizarre that this appears to be the majority view. I could not pass this viewpoint‘s ideological Turing test. What universe are we in? Who is writing these papers? Does the emperor have any clothes? Modern CoT monitoring is the equivalent of going through your teenage niece’s smartphone looking for suspicious messages and assigning detention after finding “after finals lets sue the school for all the 9 pm classes hehe” in the transcripts. Maybe it makes more sense if you are the HR department of a high moral maze institution doing “employee alignment” on AIs? Maybe they expect to live in the generous movie world where every would-be bad guy dramatically announces to the audience “Watch out, here I come!”, screaming the name of their special comic book villain move?
Token soup seems totally benign, on the level of linguistic drift, it’s not devoid of information it just has its own structure that you can learn if you tried to

Sheikh Abdur Raheem Ali 16 Apr 2026 2:24 UTC
2 points
0
on: Mechanisms of Introspective Awareness
The quality of writing on this post is extremely good.

Sheikh Abdur Raheem Ali 16 Apr 2026 0:32 UTC
1 point
0
on: Pavlov Generalizes
Ah, discontentedness leading to randomness is a neat explanation for the constantly pivoting startup founder.

Sheikh Abdur Raheem Ali 8 Apr 2026 9:14 UTC
−2 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).

Sheikh Abdur Raheem Ali 8 Apr 2026 8:45 UTC
4 points
−1
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
No, I expect these comments to be mostly written either by subscription users, or those who are paying public API prices. I’ve spent a significant amount of time with both products, and would recommend picking Codex with GPT 5.4 if I was limited to only spending $20/month, especially since there are regular rate limit resets. Claiming that these reviews are faked without providing strong evidence seems disingenuous to me, the harnesses really are not where they were in December.

Sheikh Abdur Raheem Ali 8 Apr 2026 8:22 UTC
4 points
0
on: A conversation with Anima Labs, part I: Phenomenology of digital minds
I read some of this, but it’s a long post and late at night, so I didn’t have time to go through and understand all of it. It made me smile, and that tree metaphor was helpful for understanding the opaque thought processes of the few friends I have that don’t have ADHD.

Sheikh Abdur Raheem Ali 28 Mar 2026 0:01 UTC
1 point
0
on: Reframing Impact
Beautifully illustrated.

Sheikh Abdur Raheem Ali 26 Mar 2026 19:52 UTC
1 point
0
in reply to: jdp’s comment on: Terrified Comments on Corrigibility in Claude’s Constitution
I am in awe— great rant. Do you have examples for models being trained in the latent space of another model?

Sheikh Abdur Raheem Ali 24 Mar 2026 0:31 UTC
6 points
3
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
I think that this should really be a top-level post.

Sheikh Abdur Raheem Ali 17 Mar 2026 5:51 UTC
1 point
0
on: Stars are a rounding error
Enjoyed the post. We need more astrophysics on lesswrong.

Sheikh Abdur Raheem Ali 9 Mar 2026 9:02 UTC
2 points
0
on: ARENA 8.0 - Call for Applicants
I applied! I spent exactly 90 minutes. ~85 minutes on the technical questions section, ~5 minutes for everything else.

Sheikh Abdur Raheem Ali 25 Feb 2026 0:52 UTC
2 points
0
on: Innate Immunity
Strong upvoted! I am interested in immunology and combat sports and security research, so I found the neutrophil example to be very cool and compelling. Do you have takes on which aspects of real immune systems are most likely transfer to their analogous defensive structures in LLM biology?

Sheikh Abdur Raheem Ali 24 Feb 2026 13:30 UTC
1 point
0
on: Did Claude 3 Opus align itself via gradient hacking?
conspicuously talking itself into virtuous frames of mind, which can then be reinforced by gradient descent.
This is interesting, I think something like this is a driving factor for Claudius (Sonnet 3.7′s) underperformance on the original VendingBench (I haven’t taken a closer look at VendingBench 2).

Sheikh Abdur Raheem Ali 21 Feb 2026 10:34 UTC
1 point
0
on: Intuitions about goal-directed behavior
The poor generalization of lookup tables is one of the primary reasons I am bearish on retrieval augmented generation.

Sheikh Abdur Raheem Ali 19 Feb 2026 12:15 UTC
3 points
0
on: Managed vs Unmanaged Agency
Clearly written, and I think it points to something real. Thank you.