RSS

Jozdien

Karma: 2,405

Real­is­tic Re­ward Hack­ing In­duces Differ­ent and Deeper Misalignment

Jozdien9 Oct 2025 18:45 UTC
111 points
2 comments23 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
144 points
30 comments2 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

Lighthaven Se­quences Read­ing Group #36 (Tues­day 5/​27)

26 May 2025 23:52 UTC
8 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #35 (Tues­day 5/​20)

19 May 2025 20:58 UTC
8 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #34 (Tues­day 5/​13)

10 May 2025 7:42 UTC
8 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #33 (Tues­day 5/​6)

30 Apr 2025 3:39 UTC
8 points
0 comments1 min readLW link

[LAPTOP REQUIRED] Lighthaven Se­quences Read­ing Group #32 (Tues­day 04/​29)

26 Apr 2025 3:53 UTC
12 points
0 comments2 min readLW link

Lighthaven Se­quences Read­ing Group #31 (Tues­day 04/​22)

16 Apr 2025 4:46 UTC
7 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #30 (Tues­day 04/​15)

14 Apr 2025 1:18 UTC
8 points
0 comments2 min readLW link

Lighthaven Se­quences Read­ing Group #29 (Tues­day 04/​08)

4 Apr 2025 1:16 UTC
9 points
0 comments2 min readLW link

In­tro­duc­ing BenchBench: An In­dus­try Stan­dard Bench­mark for AI Strength

Jozdien2 Apr 2025 2:11 UTC
49 points
0 comments2 min readLW link

Lighthaven Se­quences Read­ing Group #28 (Tues­day 04/​01)

26 Mar 2025 2:43 UTC
12 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #27 (Tues­day 03/​25)

20 Mar 2025 4:34 UTC
14 points
0 comments2 min readLW link

Lighthaven Se­quences Read­ing Group #26 (Tues­day 03/​18)

17 Mar 2025 20:35 UTC
7 points
0 comments2 min readLW link

Lighthaven Se­quences Read­ing Group #25 (Tues­day 03/​11)

10 Mar 2025 2:06 UTC
7 points
7 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #24 (Tues­day 03/​04)

3 Mar 2025 19:13 UTC
6 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #23 (Tues­day 02/​25)

23 Feb 2025 5:01 UTC
8 points
0 comments1 min readLW link

Lighthaven Se­quences Read­ing Group #22 (Tues­day 02/​18)

16 Feb 2025 3:51 UTC
7 points
1 comment1 min readLW link

Lighthaven Se­quences Read­ing Group #21 (Tues­day 02/​11)

6 Feb 2025 20:49 UTC
8 points
0 comments1 min readLW link