RSS

shash42

Karma: 129

New Paper: It is time to move on from MCQs for LLM Evaluations

shash426 Jul 2025 11:48 UTC
9 points
0 comments2 min readLW link

An Alter­na­tive Way to Fore­cast AGI: Count­ing Down Ca­pa­bil­ities

shash4229 Jun 2025 19:52 UTC
3 points
0 comments3 min readLW link
(open.substack.com)

In­cor­rect Baseline Eval­u­a­tions Call into Ques­tion Re­cent LLM-RL Claims

shash4229 May 2025 18:40 UTC
65 points
7 comments1 min readLW link
(safe-lip-9a8.notion.site)

Log-lin­ear Scal­ing is Worth the Cost due to Gains in Long-Hori­zon Tasks

shash427 Apr 2025 21:50 UTC
16 points
2 comments1 min readLW link

shash42′s Shortform

shash4215 Dec 2024 18:49 UTC
2 points
0 commentsLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

25 Sep 2023 17:19 UTC
25 points
3 comments7 min readLW link