Nandi

Karma: 232

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

72 points

2 comments12 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

37 points

2 comments4 min readLW link

(arxiv.org)

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

23 Oct 2023 16:33 UTC

33 points

2 comments11 min readLW link

Splitting Debate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC

13 points

5 comments4 min readLW link

Acknowledging Human Preference Types to Support Value Learning

Nandi13 Nov 2018 18:57 UTC

34 points

4 comments9 min readLW link