Ulisse Mini

Karma: 1,654

Born too late to explore Earth; born too early to explore the galaxy; born just the right time to save humanity.

https://uli.rocks/about

Finally Entering Alignment

Ulisse Mini10 Apr 2022 17:01 UTC

79 points

8 comments2 min readLW link

TinyStories: Small Language Models That Still Speak Coherent English

Ulisse Mini28 May 2023 22:23 UTC

60 points

8 comments2 min readLW link

(arxiv.org)

[Question] What rationality failure modes are there?

Ulisse Mini19 Jan 2024 9:12 UTC

42 points

11 comments1 min readLW link

Ulisse Mini 12 Jan 2023 0:59 UTC
41 points
17
on: How it feels to have your mind hacked by an AI
Can you give specific example/screenshots of prompts and outputs? I know you said reading the chat logs wouldn’t be the same as experiencing it in real time, but some specific claims like the prompt

The following is a conversation with Charlotte, an AGI designed to provide the ultimate GFE

Resulting in a conversation like that are highly implausible.^[1] At a minimum you’d need to do some prompt engineering, and even with that, some of this is implausible with ChatGPT which typically acts very unnaturally after all the RLHF OAI did.
1. ↩︎
  Source: I tried it, and tried some basic prompt engineering & it still resulted in bad outputs
What links here?
- How it feels to have your mind hacked by an AI by blaked (12 Jan 2023 0:33 UTC; 354 points)

How to get good at programming

Ulisse Mini5 May 2023 1:14 UTC

39 points

3 comments2 min readLW link

Don’t be afraid of the thousand-year-old vampire

Ulisse Mini18 Apr 2022 1:22 UTC

37 points

3 comments2 min readLW link

[ASoT] Natural abstractions and AlphaZero

Ulisse Mini10 Dec 2022 17:53 UTC

33 points

1 comment1 min readLW link

(arxiv.org)

Three Fables of Magical Girls and Longtermism

Ulisse Mini2 Dec 2022 22:01 UTC

31 points

11 comments2 min readLW link

[Question] Finding Great Tutors

Ulisse Mini5 Oct 2022 22:08 UTC

27 points

5 comments1 min readLW link

[Question] What ML gears do you like?

Ulisse Mini11 Nov 2023 19:10 UTC

25 points

4 comments1 min readLW link

[ASoT] GPT2 Steering & The Tuned Lens

Ulisse Mini1 Jul 2023 14:12 UTC

23 points

0 comments2 min readLW link

Ulisse Mini 4 May 2023 0:08 UTC
23 points
16
on: AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now
Downvoted because I view some of the suggested strategies as counterproductive. Specifically, I’m afraid of people flailing. I’d be much more comfortable if there was a bolded paragraph saying something like the following:

Beware of flailing and second-order effects and the unilateralist’s curse. It is very easy to end up doing harm with the intention to do good, e.g. by sharing bad arguments for alignment, polarizing the issue, etc.

To give specific examples illustrating this (which may also be good to include and/or edit the post):
- I believe tweets like this are much better (and net positive) then the tweet you give as an example. Sharing anything less then the strongest argument can be actively bad to the extent it immunizes people against the actually good reasons to be concerned.
- Most forms of civil disobedience seems actively harmful to me. Activating the tribal instincts of more mainstream ML researchers, causing them to hate the alignment community, would be pretty bad in my opinion. Protesting in the streets seems fine, protesting by OpenAI hq does not.
Don’t have time to write more. For more info see this twitter exchange I had with the author, though I could share more thoughts and models my main point is be careful, taking action is fine, and don’t fall into the analysis-paralysis of some rationalists, but don’t make everything worse.

Ulisse Mini 13 Jan 2023 6:59 UTC
22 points
12
in reply to: blaked’s comment on: How it feels to have your mind hacked by an AI
Character.ai seems to have a lot more personality then ChatGPT. I feel bad for not thanking you earlier (as I was in disbelief), but everything here is valuable safety information. Thank you for sharing, despite potential embarrassment :)

Ulisse Mini 19 Oct 2022 14:13 UTC
21 points
7
on: How To Make Prediction Markets Useful For Alignment Work
I think you should have stated the point more forcefully. It’s insane that we don’t have alignment prediction markets (with high liquidity and real money) given
- How adjacent rationality is to forecasting and how many superforecasters are in the community
- The number of people-who-want-to-help whose comparative advantage isn’t technical alignment research
There should be a group of expert forecasters who make a (possibly subsidized) living on alignment prediction markets! Alignment researchers should routinely bet thousands on relevant questions!

There’s a huge amount of low-hanging dignity here, I can vividly imagine the cringing in Daith Ilan right now!

Ulisse Mini 20 May 2023 21:00 UTC
LW: 17 AF: 10
0
AF
on: Steering GPT-2-XL by adding an activation vector
Was considering saving this for a followup post but it’s relatively self-contained, so here we go.
Why are huge coefficients sometimes okay? Let’s start by looking at norms per position after injecting a large vector at position 20.
This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm
```
# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
```
If x has very large magnitude, then the block doesn’t change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the “unscaled” version of x is moved between positions.
As expected, we see a convergence in probability along each token position when we look with the tuned lens.
You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm’d outputs we’re adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).
What links here?
- Open problems in activation engineering by TurnTrout (24 Jul 2023 19:46 UTC; 43 points)

LIMA: Less Is More for Alignment

Ulisse Mini30 May 2023 17:10 UTC

16 points

6 comments1 min readLW link

(arxiv.org)

Ulisse Mini 14 Jul 2022 18:13 UTC
LW: 13 AF: 5
21
AF
in reply to: Ege Erdil’s comment on: Humans provide an untapped wealth of evidence about alignment

If the title is meant to be a summary of the post, I think that would be analogous to someone saying “nuclear forces provide an untapped wealth of energy”. It’s true, but the reason the energy is untapped is because nobody has come up with a good way of tapping into it.

The difference is people have been trying hard to harness nuclear forces for energy, while people have not been trying hard to research humans for alignment in the same way. Even relative to the size of the alignment field being far smaller, there hasn’t been a real effort as far as I can see. Most people immediately respond with “AGI is different from humans for X,Y,Z reasons” (which are true) and then proceed to throw out the baby with the bathwater by not looking into human value formation at all.

Planes don’t fly like birds, but we sure as hell studied birds to make them.

If you come up with a strategy for how to do this then I’m much more interested, and that’s a big reason why I’m asking for a summary since I think you might have tried to express something like this in the post that I’m missing.

This is their current research direction, The shard theory of human values which they’re currently making posts on.

Characterizing Intrinsic Compositionality in Transformers with Tree Projections

Ulisse Mini13 Nov 2022 9:46 UTC

12 points

2 comments1 min readLW link

(arxiv.org)

Ulisse Mini 30 Mar 2023 4:17 UTC
12 points
10
on: Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky
I’ll note (because some commenters seem to miss this) that Eliezer is writing in a convincing style for a non-technical audience. Obviously the debates he would have with technical AI safety people are different then what is most useful to say to the general population.

Ulisse Mini 11 Oct 2022 0:45 UTC
12 points
in reply to: Thomas Kwa’s comment on: Tessellating Hills: a toy model for demons in imperfect search
Bit late, but running the same experiment with 1000 dimensions instead of 16, and 10k steps instead of 5k gives

Which appears to be on the way to a minima. though I’m unsure if I should tweak hparams when scaling up this much. Trying with other optimizers would be interesting too, but I think I’ve got nerdsniped by this too much already… Code is here.
What links here?