Zoe Williams

Karma: 393

Hiatus: EA and LW post summaries

Zoe Williams17 May 2023 17:17 UTC

14 points

0 comments1 min readLW link

Summaries of top forum posts (1st to 7th May 2023)

Zoe Williams9 May 2023 9:30 UTC

21 points

0 comments11 min readLW link

Summaries of top forum posts (24th − 30th April 2023)

Zoe Williams2 May 2023 2:30 UTC

12 points

1 comment10 min readLW link

Summaries of top forum posts (17th − 23rd April 2023)

Zoe Williams24 Apr 2023 4:13 UTC

18 points

0 comments8 min readLW link

Summaries of top forum posts (27th March to 16th April)

Zoe Williams17 Apr 2023 0:28 UTC

14 points

1 comment12 min readLW link

EA & LW Forum Weekly Summary (20th − 26th March 2023)

Zoe Williams27 Mar 2023 20:46 UTC

4 points

0 comments6 min readLW link

EA & LW Forum Weekly Summary (13th − 19th March 2023)

Zoe Williams20 Mar 2023 4:18 UTC

13 points

0 comments14 min readLW link

AI Safety − 7 months of discussion in 17 minutes

Zoe Williams15 Mar 2023 23:41 UTC

25 points

0 comments17 min readLW link

EA & LW Forum Weekly Summary (6th − 12th March 2023)

Zoe Williams14 Mar 2023 3:01 UTC

7 points

0 comments12 min readLW link

EA & LW Forum Weekly Summary (27th Feb − 5th Mar 2023)

Zoe Williams6 Mar 2023 3:18 UTC

12 points

0 comments11 min readLW link

EA & LW Forum Weekly Summary (20th − 26th Feb 2023)

Zoe Williams27 Feb 2023 3:46 UTC

4 points

0 comments14 min readLW link

EA & LW Forum Weekly Summary (6th − 19th Feb 2023)

Zoe Williams21 Feb 2023 0:26 UTC

8 points

0 comments14 min readLW link

EA & LW Forum Weekly Summary (30th Jan − 5th Feb 2023)

Zoe Williams7 Feb 2023 2:13 UTC

3 points

3 comments14 min readLW link

Zoe Williams 1 Feb 2023 22:06 UTC
1 point
−7
on: Inner Misalignment in “Simulator” LLMs
Post summary (feel free to suggest edits!):
The author argues that the “simulators” framing for LLMs shouldn’t reassure us much about alignment. Scott Alexander has previously suggested that LLMs can be thought of as simulating various characters eg. the “helpful assistant” character. The author broadly agrees, but notes this solves neither outer (‘be careful what you wish for’) or inner (‘you wished for it right, but the program you got had ulterior motives’) alignment.
They give an example of each failure case:
For outer alignment, say researchers want a chatbot that gives helpful, honest answers—but end up with a sycophant who tells the user what they want to hear. For inner alignment, imagine a prompt engineer asking the chatbot to reply with how to solve the Einstein-Durkheim-Mendel conjecture as if they were ‘Joe’, who’s awesome at quantum sociobotany. But the AI thinks the ‘Joe’ character secretly cares about paperclips, so gives an answer that will help create a paperclip factory instead.

(This will appear in this week’s forum summary. If you’d like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
What links here?
- Adam Scherlis's comment on Inner Misalignment in “Simulator” LLMs by Adam Scherlis (2 Feb 2023 0:33 UTC; 1 point)

EA & LW Forum Weekly Summary (23rd − 29th Jan ’23)

Zoe Williams31 Jan 2023 0:36 UTC

12 points

0 comments13 min readLW link

EA & LW Forum Weekly Summary (16th − 22nd Jan ’23)

Zoe Williams23 Jan 2023 3:46 UTC

13 points

0 comments9 min readLW link

EA & LW Forum Summaries (9th Jan to 15th Jan 23′)

Zoe Williams18 Jan 2023 7:29 UTC

17 points

0 comments13 min readLW link

EA & LW Forum Summaries—Holiday Edition (19th Dec − 8th Jan)

Zoe Williams9 Jan 2023 21:06 UTC

11 points

0 comments8 min readLW link

EA & LW Forums Weekly Summary (12th Dec − 18th Dec 22′)

Zoe Williams20 Dec 2022 9:49 UTC

10 points

0 comments17 min readLW link

EA & LW Forums Weekly Summary (5th Dec − 11th Dec 22′)

Zoe Williams13 Dec 2022 2:53 UTC

7 points

0 comments18 min readLW link