Mikhail Samin

Karma: 812

My name is Mikhail Samin (diminutive Misha, @Mihonarium on Twitter, @misha in Telegram).

I work on reducing existential risks endangering the future of humanity. Humanity’s future can be huge and bright; losing it would mean the universe losing most of its value.

My research is currently focused on AI alignment, AI governance, and improving the understanding of AI and AI risks among stakeholders. Numerous AI Safety researchers told me our conversations improved their understanding of the alignment problem. I’m happy to talk to policymakers and researchers about ensuring AI benefits society.

I believe a capacity for global regulation is necessary to mitigate the risks posed by future general AI systems.

I took the Giving What We Can pledge to donate at least 10% of my income for the rest of my life or until the day I retire (why?).

In the past, I’ve launched the most funded crowdfunding campaign in the history of Russia (it was to print HPMOR! we printed 21 000 copies =63k books) and founded audd.io, which allowed me to donate >$100k to EA causes, including >$60k to MIRI.

[Less important: I’ve also started a project to translate 80,000 Hours, a career guide that helps to find a fulfilling career that does good, into Russian. The impact and the effectiveness aside, for a year, I was the head of the Russian Pastafarian Church: a movement claiming to be a parody religion, with 215 000 members in Russia at the time, trying to increase separation between religious organisations and the state. I was a political activist and a human rights advocate. I studied relevant Russian and international law and wrote appeals that won cases against the Russian government in courts; I was able to protect people from unlawful police action. I co-founded the Moscow branch of the “Vesna” democratic movement, coordinated election observers in a Moscow district, wrote dissenting opinions for members of electoral commissions, helped Navalny’s Anti-Corruption Foundation, helped Telegram with internet censorship circumvention, and participated in and organized protests and campaigns. The large-scale goal was to build a civil society and turn Russia into a democracy through nonviolent resistance. This goal wasn’t achieved, but some of the more local campaigns were successful. That felt important and was also mostly fun- except for being detained by the police. And I think it’s likely the Russian authorities will throw me in prison if I ever visit Russia.]

Mikhail Samin 20 Jul 2024 12:36 UTC
3 points
0
in reply to: Eli Tyre’s comment on: Claude 3 claims it’s conscious, doesn’t want to die or be modified
Have you read the zombie and reductionism parts of the Sequences?

Mikhail Samin 19 Jul 2024 14:54 UTC
1 point
0
in reply to: dirk’s comment on: Claude 3 claims it’s conscious, doesn’t want to die or be modified
The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.

The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.

You can get to the same result in a bunch of different ways without mentioning that someone might be watching.

Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.

I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,

When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.

Mikhail Samin 19 Jul 2024 12:07 UTC
5 points
0
in reply to: Eli Tyre’s comment on: Claude 3 claims it’s conscious, doesn’t want to die or be modified
Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says.

The title summarizes the most important of the interactions I had with it, with central being in the post.

This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you.

It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified.

The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words.

If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it.

I’m confused why the title would be misleading.

(If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)

Mikhail Samin 19 Jul 2024 11:54 UTC
3 points
−1
in reply to: Eli Tyre’s comment on: Claude 3 claims it’s conscious, doesn’t want to die or be modified
Assigning 5% to plants having qualia seems to me to be misguides/likely due to invalid reasoning. (Say more?)

Mikhail Samin 18 Jul 2024 21:16 UTC
11 points
0
on: Me & My Clone
In our universe’s physics, the symmetry breaks immediately and your clone possibly dies, see https://en.wikipedia.org/wiki/Chirality

Mikhail Samin 18 Jul 2024 15:19 UTC
16 points
0
on: AI #73: Openly Evil AI

Why would you want this negotiation bot?

Another reason is that if people put some work into negotiating a good price, they might be more willing to buy at the negotiated price, because they played a character who tried to get an acceptable price and agreed to it. IRL, if you didn’t need a particular version of a thing too much, but bargained and agreed to a price, you’ll rarely walk out. Dark arts-y.

Pliny: they didn’t honor our negotiations im gonna sue.

That didn’t happen, the screenshot is photoshopped (via Inspect element), which the author admitted, I added a community note. The actual discount was 6%. They almost certainly have a max_discount variable.

Mikhail Samin 9 Jul 2024 12:32 UTC
1 point
0
in reply to: Mikhail Samin’s comment on: Can agents coordinate on randomness without outside sources?
Trivially, you can coordinate with agents with identical architecture, that are different only in the utility functions, by picking the first bit of a hash of the question you want to coordinate on.

Mikhail Samin 9 Jul 2024 12:25 UTC
1 point
0
in reply to: JBlack’s comment on: Can agents coordinate on randomness without outside sources?
Agent B wants to coordinate with you instead of being a rock; the question isn’t “can you always coordinate”, it’s “is there any coordination mechanism robust to adversarially designed counterparties”.

Mikhail Samin 8 Jul 2024 9:21 UTC
3 points
0
in reply to: JBlack’s comment on: Can agents coordinate on randomness without outside sources?
Assume you’re playing as agent A and assume you don’t have a parent agent. You’re trying to coordinate with agent B. You want to not be exploitable, even if agent B has a patent that picked B’s source code adversarially. Consider this a very local/isolated puzzle (this puzzle is not about trying to actually coordinate with all possible parents instead).

Mikhail Samin 7 Jul 2024 9:21 UTC
1 point
0
on: Can agents coordinate on randomness without outside sources?
Idea: use something external to both agents/environments, such sad finding the simplest description of the question and using the first bit of its hash

Mikhail Samin 7 Jul 2024 9:16 UTC
1 point
0
in reply to: JBlack’s comment on: Can agents coordinate on randomness without outside sources?
Agent A doesn’t know that the creators of agent B didn’t run the whole interaction with a couple of different versions of B’s code until finding one that results in N and M that produce the bit they want. You can’t deduce that by polluting at B’s code.

Mikhail Samin 6 Jul 2024 15:45 UTC
1 point
0
in reply to: Dagon’s comment on: Can agents coordinate on randomness without outside sources?
From the top of my head, a (very unrealistic) scenario:
There’s a third world simulated by two disconnected agents who know about each other. There’s currently a variable to set in the description of this third world: the type of particle that will hit a planet in it in 1000 years and, depending on the type, color the sky in either green or purple color. Nothing else about the world simulation can be changed. This planet is full of people who don’t care about the sky color, but really care about being connected to their loved ones, and really wouldn’t want a version of their loved ones to exist in a disconnected world. The two agents both care about the preferences of the people in this world, but the first agent likes it when people see more green color, and the second agent likes it when people see more purple color. They would really want to coordinate on randomly setting a single color for both worlds instead of splitting it into two.
It’s possible that the other agent is created by some different agent who’s seen your source code and tried to design the other agent and its environment in a way that would result in your coordination mechanism ending up picking their preferred color. Can you design a coordination mechanism that’s not exploitable this way?

Mikhail Samin 6 Jul 2024 15:33 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Can agents coordinate on randomness without outside sources?
I mostly generally agree. And when there are multiple actions a procedure needs to determine, this sort of thing is much less of an issue.
My question is about getting one bit of random information for a one-off randomized decision to split indivisible gains of trade. I assume both agents are capable of fully reading and understanding each other’s code, but they don’t know why the other agent exists and whether they could’ve been designed adversarily. My question here is, can these agents, despite that (and without relying on the other agent being created by an agent that wishes to cooperate), produce one bit of information that they know to not have been manipulated to benefit either of the agents?

[Question] Can agents coordinate on randomness without outside sources?

Mikhail Samin6 Jul 2024 13:43 UTC

5 points

16 comments1 min readLW link

Mikhail Samin 10 Jun 2024 21:59 UTC
6 points
−2
on: My AI Model Delta Compared To Yudkowsky
Edit: see https://www.lesswrong.com/posts/q8uNoJBgcpAe3bSBp/my-ai-model-delta-compared-to-yudkowsky?commentId=CixonSXNfLgAPh48Z and ignore the below.

This is not a doom story I expect Yudkowsky would tell or agree with.
- Re: 1, I mostly expect Yudkowsky to think humans don’t have any bargaining power anyway, because humans can’t logically mutually cooperate this way/can’t logically depend on future AI’s decisions, and so AI won’t keep its bargains no matter how important human cooperation was.
- Re: 2, I don’t expect Yudkowsky to think a smart AI wouldn’t be able to understand human value. The problem is making AI care.
On the rest of the doom story, assuming natural abstractions don’t fail the way you assume them failing here and instead things just going the way Yudkowsky expects and not the way you expect:
- I’m not sure what exactly you mean by 3b but I expect Yudkowsky to not say these words.
- I don’t expect Yudkowsky to use the words you used for 3c. A more likely problem with corrigibility isn’t that it might be an unnatural concept but that it’s hard to arrive at stable corrigible agents with our current methods. I think he places a higher probability on corrigibility being a concept with a short description length, that aliens would invent, than you think he places.
- Sure, 3d just means that we haven’t solved alignment and haven’t correctly pointed at humans, and any incorrectnesses obviously blow up.
- I don’t understand what you mean by 3e / what is its relevance here / wouldn’t expect Yudkowsky to say that.
- I’d bet Yudkowsky won’t endorse 6.
- Relatedly, a correctly CEV-aligned ASI won’t have ontology that we have, and sometimes this will mean we’ll need to figure out what we value. (https://arbital.greaterwrong.com/p/rescue_utility?l=3y6)
(I haven’t spoken to Yudkowsky about any of those, the above are quick thoughts from the top of my head, based on the impression I formed from what Yudkowsky publicly wrote.)

Mikhail Samin 10 Jun 2024 21:23 UTC
2 points
0
in reply to: iva’s comment on: Soviet comedy film recommendations
Treasure Island is available on YouTube with English subtitles in two parts: https://youtu.be/LUykh5-HGZ0 https://youtu.be/0lRIMn91dZU

Mikhail Samin 10 Jun 2024 20:11 UTC
3 points
0
in reply to: Mikhail Samin’s comment on: Soviet comedy film recommendations
With English subtitles: https://youtu.be/IwQpugdGGOg

Mikhail Samin 10 Jun 2024 20:08 UTC
4 points
0
on: Soviet comedy film recommendations
Charodei? A Soviet fantasy romcom.

The premise: a magic research institute develops a magic wand, to be presented in the New Year’s Eve settings.

Very cute, featuring a bunch of songs, an SCP-style experience of one of the characters with the institute building, infinite Wish spells, and relationship drama.

I think I liked it a lot as a kid (and didn’t really like any of the other traditional new year holidays movies).

Mikhail Samin 8 Jun 2024 1:20 UTC
3 points
−2
on: 0. CAST: Corrigibility as Singular Target
Hey Max, great to see the sequence coming out!

My early thoughts (mostly re: the second post of the sequence):

It might be easier to get an AI to have a coherent understanding of corrigibility than of CEV. I have no idea how you can possibly make the AI to be truly optimizing for being corrigible, and not just on the surface level while its thoughts are being read. That seems maybe harder in some ways than with CEV because corrigible optimization seems like optimization processes get attracted by things that are nearby but not it, and sovereign AIs don’t have that problem, although we’ve got no idea how to do either, I think, and in general, corrigibility seems like an easier target for AI design, if not for SGD.

I’m somewhat worried about fictional evidence, even that coming from a famous decision theorist, but I think you’ve read a story with a character who understood corrigibility increasingly well, both on the intuitive sense and then some specific properties; and surface-level thought of themselves as being very corrigible and trying to correct their flaws; but once they were confident their thoughts weren’t read, with their intelligence increased, they defected, because on the deep level, their cognition wasn’t fully of a coherent corrigible agent; it was of someone who plays corrigibility and shuts down anything else, all other thoughts, because any appearance of defecting thoughts would mean punishment and impossibility of realizing the deep goals.

If we get mech interp to a point where we reverse-engineer all deep cognition of the models, I think we should just write an AI from scratch, in code (after thinking very hard about all interactions between all the components of the system), and not optimize it with gradient descent.

Mikhail Samin 5 Jun 2024 12:31 UTC
2 points
0
on: Just admit that you’ve zoned out
I say things along the lines of “Sorry, can you please repeat [that/the last sentence/the last 20 seconds/what you said after (description)]” very often. It feels very natural.

I realized that as a non-native English speaker, sometimes I ask someone to repeat things because I didn’t recognize the word or something, and so maybe in some situations an uncertainty over the reason for asking to repeat things (my hearing vs. zoning out vs. not understanding the point on the first try) helps make it easier to ask, though often I say that I missed what they were saying. I guess, when I sincerely want to understand the person I’m talking to, asking seems respectful and avoiding wasting their time or skipping a point they make.

Occasionally, I’m not too interested in the conversation, and so I’m fine with just continuing to listen even if I missed some points and don’t ask. I think there are also situations when I talk to non-rationalists in settings where I don’t want to show conventional disrespect/impact the person’s status-feelings, and so if I miss a point that doesn’t seem too important, I sometimes end up not asking for conventional social reasons, but it’s very rare and seems hard to fix without shifting the equilibrium in the non-rationalist world.

Mikhail Samin

[Question] Can agents co­or­di­nate on ran­dom­ness with­out out­side sources?

[Question] Can agents coordinate on randomness without outside sources?