Jozdien(Arun Jose)

Karma: 1,212

The case for more ambitious language model evals

Jozdien30 Jan 2024 0:01 UTC

105 points

25 comments5 min readLW link

AI Safety via Luck

Jozdien1 Apr 2023 20:13 UTC

76 points

7 comments11 min readLW link

Thoughts On (Solving) Deep Deception

Jozdien21 Oct 2023 22:40 UTC

66 points

2 comments6 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

58 points

8 comments20 min readLW link

Gradient Filtering

Jozdien and janus

18 Jan 2023 20:09 UTC

54 points

16 comments13 min readLW link

Trying to isolate objectives: approaches toward high-level interpretability

Jozdien9 Jan 2023 18:33 UTC

48 points

14 comments8 min readLW link

Critiques of the AI control agenda

Jozdien14 Feb 2024 19:25 UTC

47 points

14 comments9 min readLW link

[ASoT] Finetuning, RL, and GPT’s world prior

Jozdien2 Dec 2022 16:33 UTC

44 points

8 comments5 min readLW link

Gradient Descent on the Human Brain

Jozdien and gaspode

1 Apr 2024 22:39 UTC

42 points

4 comments2 min readLW link

The Pointer Resolution Problem

Jozdien16 Feb 2024 21:25 UTC

41 points

20 comments3 min readLW link

[ASoT] Simulators show us behavioural properties by default

Jozdien13 Jan 2023 18:42 UTC

33 points

2 comments3 min readLW link

Difficulty classes for alignment properties

Jozdien20 Feb 2024 9:08 UTC

32 points

5 comments2 min readLW link

Insufficient Values

Jozdien, Jacob Abraham and Abraham Francis

16 Jun 2021 14:33 UTC

31 points

15 comments5 min readLW link

Utopic Nightmares

Jozdien14 May 2021 21:24 UTC

10 points

20 comments5 min readLW link

Gaming Incentives

Jozdien29 Jul 2021 13:51 UTC

10 points

4 comments6 min readLW link