Demystifying “Alignment” through a Comic

milanrosko9 Jun 2024 8:24 UTC

109 points

Art AI Conceptual Media AI Alignment Fieldbuilding Has Diagram Inner Alignment

I am keeping this comic online as a record of an earlier perspective. My current work in fixed point logic and proof theory has led me to reject several foundational assumptions in mainstream alignment and interpretability research. This does not imply that AI poses no risks (it implies nothing about AI).

What links here?

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking by tailcalled (10 Jun 2024 21:20 UTC; 29 points)

milanrosko9 Jun 2024 8:24 UTC

109 points

19 comments1 min readLW link

Art AI Conceptual Media AI Alignment Fieldbuilding Has Diagram Inner Alignment

the gears to ascension 9 Jun 2024 13:35 UTC
33 points
32
edit #2: the title I was commenting on has been removed. I’d encourage others to reconsider their downvote.
~~Please chill about downvotes. If you edit the title back, you are likely to on net get upvoted. I had just upvoted when you changed it. Give it time to settle before you get doom and gloom about it.~~
edit: the title was originally something like “an alignment comic”. OP is trying to throw away a perfectly good post that, in my estimation, wasn’t even previously going to settle on a negative score, just was briefly negative.
- habryka 9 Jun 2024 17:44 UTC
  11 points
  9
  Parent
  I changed the title back (from “Downvote this”) and removed the first sentence (which complained about downvotes), since it seems like it changed people’s voting behavior a lot in a confusing way. My best guess is this post would actually be pretty popular with its original title (and I also upvoted it).
  - milanrosko 9 Jun 2024 22:15 UTC
    2 points
    0
    Parent
    Good job. Thank you and have a nice week.
- Zack_M_Davis 9 Jun 2024 17:03 UTC
  9 points
  5
  Parent
  Seconding this. A nonobvious quirk of the system where high-karma users get more vote weight is that it increases variance for posts with few votes: if a high-karma user or two who don’t like you see your post first, they can trash the initial score in a way that doesn’t reflect “the community’s” consensus. I remember the early karma scores for one of my posts going from 20 to zero (!). It eventually finished at 131.
  - Adam Shai 9 Jun 2024 18:53 UTC
    3 points
    1
    Parent
    I’ve also noticed this phenomenon. I wonder if a solution would be to have an initial period where votes are considered more democratically, and then after that period the influence of high-karma users are applied (including back applying the influence of votes that occured during the intial period). I can also imagine downsides to this.
milanrosko 9 Jun 2024 22:00 UTC
18 points
2
Anyway, I’ve corrected a few of the issues that some pointed out. Thanks for the helpful feedback.

What deeply frustrates me is the way content is rated, ordered and created as a broad phenomenon in today’s internet. I find it challenging to cope with how cultural products are dumbed down to the point of being unbearable because of algorithms, click optimization. This bitterness is beginning to affect my life and my relationships with people in general.
- habryka 9 Jun 2024 22:03 UTC
  3 points
  0
  Parent
  You’re welcome! Sorry for the variance in the karma system. It does take quite a while for things to settle.
- ilm 11 Jun 2024 16:44 UTC
  2 points
  0
  Parent
  I liked your post.
  
  What deeply frustrates me is the way content is rated, ordered and created as a broad phenomenon in today’s internet.
  
  Tangential, but I’d be interested to hear/read more about this. I have similar feelings but thoughts around this are very disorganized and blurry, rather than well laid out and clear.
TsviBT 9 Jun 2024 19:42 UTC
11 points
6
Since the author seems to have been discouraged at one point:

The good: The images (the blobs) are really good. Cute, quirky, engaging, in some cases good explainers. Overall the production value seems high.

The confusing: I’m not familiar with manga, e.g. its idioms or vibes; so there’s probably stuff you were going for that I just wouldn’t understand.

The probably bad: I’m guessing that the writing has too many leaps that are too unclear. I could imagine this being a style, like in a fictional work where things are sort of referenced without really being explained but referenced in a way that makes them “feel real” to the reader (and maybe makes them go look it up). But I’d guess the comic doesn’t hit that as-is. Possibly the comic would benefit from test readers (preferably from your target audience) who you talk with to see where they got bored / confused.
tailcalled 9 Jun 2024 17:59 UTC
8 points
0
In the above comic, the AI is trained by having the human look at its behavior, judge what method the AI is solving the desired task and how much progress it’s making along that method, and then selecting the ones that make more progress.

If we see the AI start studying the buttons to decide what to do, we can just select against that. It gets its capabilities entirely from our judgement of the likely consequences, so while that can lead to deception in simple cases where it accidentally stumbles into confusing us (e.g. by going in front of the sugar), this doesn’t imply a selection in favor of complex misalignment/deception that is so unlikely to happen by chance that you wouldn’t stumble into it without many steps of intentional selection.

See also: reward is not the optimization target.
oumuamua 9 Jun 2024 11:51 UTC
6 points
6
Some hopefully constructive criticism:
- I believe it’s “agentic”, not “agentive”.
- “Save scumming” isn’t a widely known term. If I hadn’t known exactly where this was going, it might have confused me. Consider replacing it with something like “trial and error”.
- I would rework the part where the blob bites the finger off, it causes people to ask stuff like “but how should a piece of Software bite my finger?”, this derails the conversation. Don’t specify exactly how it’s going to try to prevent the pushing of the button, explain that it has a strong inventive to do so, that it is correct about that, and that it can use the abilities which it learned to to understand and manipulate the world to accomplish that.
Edit: To end this on a positive note: This format is under explored. We need more “alignment is hard 101” content that is as convincing as possible, without making use of deception. Thank you for creating something that could become very valuable with a bit of iterative improvement. Like, genuinely. Thank you.
- milanrosko 9 Jun 2024 22:02 UTC
  3 points
  0
  Parent
  Corrected to agentic and changed the part where it derails a bit. Thank you.
- milanrosko 9 Jun 2024 13:25 UTC
  3 points
  0
  Parent
  Yeah I will close this attempt. I mean currently it has −1 and some other dumb thread that is about how NYC is has like 21. Nah… Fuck this.
  - oumuamua 9 Jun 2024 16:05 UTC
    14 points
    9
    Parent
    Don’t do this, please. Just wait and see. This community is forgiving about changing ones mind.
geoffreymiller 11 Jun 2024 16:29 UTC
3 points
1
This is really good, and it’ll be required reading for my new ‘Psychology and AI’ class that I’ll teach next year.
Students are likely to ask ‘If the blob can figure out so much about the world, and modify its strategies so radically, why does it still want sugar? Why not just decide to desire something more useful, like money, power, and influence?’
- milanrosko 13 Jun 2024 4:52 UTC
  2 points
  1
  Parent
  Wow what an honor! Thank you.
Tapatakt 10 Jun 2024 14:14 UTC
2 points
0
I think the short explanations with good memetic potential can hold a lot of value. Thank you!
Probably will translate this into Russian someday.
Review Bot 13 Jun 2024 1:20 UTC
1 point
0
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Dagon 9 Jun 2024 16:13 UTC
−2 points
3
I kind of like this (and it kind of misses me, but not badly). But the title and first sentence annoyed me enough that I followed instructions and downvoted.
edit: retracting because the title and first sentence are no longer complaints about karma and a request to downvote.