Nandi

Karma: 232

Intricacies of Feature Geometry in Large Language Models

7vik, Lucius Bushnaq and Nandi

7 Dec 2024 18:10 UTC

72 points

2 comments12 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

[Paper] Hidden in Plain Text: Emergence and Mitigation of Steganographic Collusion in LLMs

Yohan Mathew, joanv, robert mccarthy, ollie, Nandi and Dylan Cope

25 Sep 2024 14:52 UTC

37 points

2 comments4 min readLW link

(arxiv.org)

Robustness of Contrast-Consistent Search to Adversarial Prompting

Nandi, i, Jamie Wright, Seamus_F and hugofry

1 Nov 2023 12:46 UTC

18 points

1 comment7 min readLW link

Machine Unlearning Evaluations as Interpretability Benchmarks

NickyP and Nandi

23 Oct 2023 16:33 UTC

33 points

2 comments11 min readLW link

Nandi 6 Jul 2020 22:21 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Splitting Debate up into Two Subsystems
I agree that if you score an oracle based on how accurate it is, then it is incentivized to steer the world towards states where easy questions get asked.
I think that in these considerations it matters how powerful we assume the agent to be. You made me realize that specifying the scope and detailing the application area of the proposed approach better could have made my post more interesting. In many cases making the world more predictable may be very difficult for the agent, compared to causing the human to predict the world better. In the short term I think deploying an agentic oracle could be safe.

Nandi 4 Jul 2020 19:20 UTC
1 point
0
in reply to: lemonhope’s comment on: Splitting Debate up into Two Subsystems
I think Bostrom might have mentioned this problem (educating someone on a topic) somewhere.
Cool! I’m not familiar with it

Nandi 4 Jul 2020 19:19 UTC
1 point
0
in reply to: lemonhope’s comment on: Splitting Debate up into Two Subsystems
In the case that the epistemic helper can explain us enough for us to come up with solutions ourselves, the info helper is as useful by itself.

However, sometimes even if we get educated about a domain or problem, we may not be creative enough to propose good solutions ourselves. In such cases we would need an agent to propose options to us. It would be good if an agent that gets trained to come up with solutions that we approve of is not the same agent that explains to us why we should or should not approve of a solution (because if it were, it would have an incentive to convince us).

Splitting Debate up into Two Subsystems

Nandi3 Jul 2020 20:11 UTC

13 points

5 comments4 min readLW link

Acknowledging Human Preference Types to Support Value Learning

Nandi13 Nov 2018 18:57 UTC

34 points

4 comments9 min readLW link