Recent comments

Arthur Conmy 28 Apr 2024 23:57 UTC
2 points
0
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
+1 to Neel. We just fixed a release bug and now pip install transformer-lens should install 1.16.0 (worked in a colab for me)

Nikita Sokolsky 28 Apr 2024 23:52 UTC
1 point
0
on: ACX/LW Seattle spring meetup 2024
We’re in the north-west corner of Stoup

Raemon 28 Apr 2024 23:50 UTC
4 points
2
on: Raemon’s Shortform Feed
Yesterday I was at a “cultivating curiosity” workshop beta-test. One concept was “there are different mental postures you can adopt, that affect how easy it is not notice and cultivate curiosities.”
It wasn’t exactly the point of the workshop, but I ended up with several different “curiosity-postures”, that were useful to try on while trying to lean into “curiosity” re: topics that I feel annoyed or frustrated or demoralized about.
The default stances I end up with when I Try To Do Curiosity On Purpose are something like:
1. Dutiful Curiosity (which is kinda fake, although capable of being dissociatedly autistic and noticing lots of details that exist and questions I could ask)
2. Performatively Friendly Curiosity (also kinda fake, but does shake me out of my default way of relating to things. In this, I imagine saying to whatever thing I’m bored/frustrated with “hullo!” and try to acknowledge it and and give it at least some chance of telling me things)
But some other stances to try on, that came up, were:
3. Curiosity like “a predator.” “I wonder what that mouse is gonna do?”
4. Earnestly playful curiosity. “oh that [frustrating thing] is so neat, I wonder how it works! what’s it gonna do next?”
5. Curiosity like “a lover”. “What’s it like to be that you? What do you want? How can I help us grow together?”
6. Curiosity like “a mother” or “father” (these feel slightly different to me, but each is treating [my relationship with a frustrating thing] like a small child who is bit scared, who I want to help, who I am generally more competent than but still want to respect the autonomy of.”
7. Curiosity like “a competent but unemotional robot”, who just algorithmically notices “okay what are all the object level things going on here, when I ignore my usual abstractions?”… and then “okay, what are some questions that seem notable?” and “what are my beliefs about how I can interact with this thing?” and “what can I learn about this thing that’d be useful for my goals?”

Seth Herd 28 Apr 2024 23:22 UTC
2 points
0
in reply to: PeterMcCluskey’s comment on: Book review: Deep Utopia
Thanks! I’m also uninterested in the question of whether it’s possible. Obviously it is. The question is how we’ll decide to use it. I think that answer is critical to whether we’d consider the results utopian. So, does he consider how we should or will use that ability?

Said Achmiz 28 Apr 2024 23:20 UTC
2 points
0
on: Book review: Deep Utopia

But another approach would be to create a mechanism that serves thesame function as pain without being painful. Imagine an “exoskin”: alayer of nanotech sensors so thin that we can’t feel it or see it,but which monitors our skin surface for noxious stimuli. If we put ourhand on a hot plate, … the mechanism contracts our muscle fibers soas to make our hand withdraw

I recommend Gwern’s discussion of pain to anyone who finds this sort of proposal intriguing (or anyone who is simply interested in the subject).

Said Achmiz 28 Apr 2024 23:15 UTC
2 points
0
in reply to: Richard_Kennaway’s comment on: Book review: Deep Utopia
I would go further, and say that replacing human civilization with “a monoculture of healthy, happy, well-fed people living in peace and harmony” does in fact sound very bad. Never mind these aliens (who cares what they think?); from our perspective, this seems like a bad outcome. Not by any means the worst imaginable outcome… but still bad.

RedMan 28 Apr 2024 23:13 UTC
2 points
0
on: Constructability: Plainly-coded AGIs may be feasible in the near future
Every time you use an AI tool to write a regex to replace your ML classifier, you’re doing this.

Said Achmiz 28 Apr 2024 23:13 UTC
2 points
0
in reply to: Waldvogel’s comment on: Book review: Deep Utopia
If they’re merely opining, then why should we be appalled? Why would we even care? Let them opine to one another; it doesn’t affect us.

If they’re intervening (without our consent), then obviously this is a violation of our sovereignty and we should treat it as an act of war.

In any case, one “preserves” what one owns. These hypothetical advanced aliens are speaking as if they own us and our planet. This is obviously unacceptable as far as we’re concerned, and it would behoove us in this case to disabuse these aliens of such a notion at our earliest convenience.

Conversely, it makes perfect sense to speak of humans as collectively owning the natural resources of the Earth, including all the animals and so on. As such, wishing to preserve some aspects of it is entirely reasonable. (Whether we ultimately choose to do so is another question—but that it’s a question for us to answer, according to our preferences, is clear enough.)

lukehmiles 28 Apr 2024 23:11 UTC
1 point
0
in reply to: lukehmiles’s comment on: lcmgcd’s Shortform
What monster downvoted this

ChristianKl 28 Apr 2024 23:11 UTC
2 points
0
in reply to: tailcalled’s comment on: Losing Faith In Contrarianism
Generally, hedgehogs are less trustworthy than foxes. If you see a debate as being about either believing in a mainstream hedgehog position or a contrarian hedgehog position you are often not having the most accurate view.
Instead of thinking that either Matthew Walker or Guzey is right, maybe the truth lies somewhere in the middle and Guzey is pointing to real issues but exaggerating the effect.
I think most of the cases that the OP lists are of that nature that there’s an effect and that the hedgehog contrarian position exaggerates that effect.

Nikita Sokolsky 28 Apr 2024 22:58 UTC
1 point
0
on: ACX/LW Seattle spring meetup 2024
Suggested discussion questions / ice breakers for today’s meetup, assembled from ACX posts in the past 6 months. See you all in one hour :-)
1. What was the most interesting question for you from the recent ACX survey?
2. What do you think of the Coffeepocalypse argument in relation to AI risk?
3. Do you agree with the Robin Hanson idea that (more) medicine doesn’t work?
4. Do you like the “Ye Olde Bay Area House Party” series of posts?
5. What do you think about the Lumina Probiotic? Are you planning to order it in the future?
6. What’s your position on the COVID lab leak debate?
7. Do you like prediction markets? What was a prediction you’ve made in the past year that you’re proud of?
8. What book would you review for the ACX book review contest if you were to write one?
9. Do you believe that capitalism is more effective than charity in solving world poverty?
10. Which dictator did you find the most interesting from the “Dictator Book Club” series?

Neel Nanda 28 Apr 2024 22:46 UTC
LW: 4 AF: 4
0
AF
in reply to: magnetoid’s comment on: Refusal in LLMs is mediated by a single direction
It was added recently and just added to a new release, so pip install transformer_lens should work now/soon (you want v1.16.0 I think), otherwise you can install from the Github repo

Abhimanyu Pallavi Sudhir 28 Apr 2024 22:40 UTC
1 point
0
on: Abhimanyu Pallavi Sudhir’s Shortform
current LLMs vs dangerous AIs

Most current “alignment research” with LLMs seems indistinguishable from “capabilities research”. Both are just “getting the AI to be better at what we want it to do”, and there isn’t really a critical difference between the two.

Alignment in the original sense was defined oppositionally to the AI’s own nefarious objectives. Which LLMs don’t have, so alignment research with LLMs is probably moot.

something related I wrote in my MATS application:
1. I think the most important alignment failure modes occur when deploying an LLM as part of an agent (i.e. a program that autonomously runs a limited-context chain of thought from LLM predictions, maintains a long-term storage, calls functions such as search over storage, self-prompting and habit modification either based on LLM-generated function calls or as cron-jobs/hooks).
2. These kinds of alignment failures are (1) only truly serious when the agent is somehow objective-driven or equivalently has feelings, which current LLMs have not been trained to be (I think that would need some kind of online learning, or learning to self-modify) (2) can only be solved when the agent is objective-driven.

PeterMcCluskey 28 Apr 2024 22:30 UTC
2 points
0
in reply to: Seth Herd’s comment on: Book review: Deep Utopia
He predicts that it will be possible to do things like engineer away sadness. He doesn’t devote much attention to convincing skeptics that such engineering will be possible. He seems more interested in questions of whether we should classify the results as utopian.

kave 28 Apr 2024 22:23 UTC
2 points
0
on: An Unintentional Compliment
D&D.Sci forces the reader to think harder than anything else on this website
D&D.Sci smoothly entices me towards thinking hard. There’s lots of thinking hard that can be done when reading a good essay, but the default is always to read on (cf Feynman on reading papers) and often I just do that while skipping the thinking hard.

tailcalled 28 Apr 2024 22:21 UTC
4 points
0
in reply to: Seth Herd’s comment on: [Aspiration-based designs] 1. Informal introduction
See also: satisficers tend to seek power: instrumental convergence through retargetability / parametrically retargetable decision-makers tend to seek power.

habryka 28 Apr 2024 21:56 UTC
2 points
0
in reply to: abstractapplic’s comment on: D&D.Sci
I edited the top-comment to do that.

Carl Feynman 28 Apr 2024 21:55 UTC
1 point
0
on: List your AI X-Risk cruxes!
Here’s an event that would change my p(doom) substantially:
Someone comes up with an alignment method that looks like it would apply to superintelligent entities. They get extra points for trying it and finding that it works, and extra points for society coming up with a way to enforce that only entities that follow the method will be created.
So far none of the proposed alignment methods seem to stand up to a superintelligent AI that doesn’t want to obey them. They don’t even stand up to a few minutes of merely human thought. But it‘s not obviously impossible, and lots of smart people are working on it.
In the non-doom case, I think one of the following will be the reason:
—Civilization ceases to progress, probably because of a disaster.
—The governments of the world ban AI progress.
—Superhuman AI turns out to be much harder than it looks, and not economically viable.
—The above happy circumstance, giving us the marvelous benefits of superintelligence without the omnicidal drawbacks.

Seth Herd 28 Apr 2024 21:38 UTC
3 points
0
on: [Aspiration-based designs] 1. Informal introduction
I applaud the effort. Big upvote for actually trying to solve the problem, by coming up with a way to create safe, aligned AGI. If only more people were doing this instead of hand wringing, arguing, or “working on the problem” in poorly-thought-out, too-indirect-to-probably-help-in-time ways. Good job going straight for the throat.

That said: It seems to me like the problem isn’t maximization or even optimization; it’s conflicting goals.

If I have a goal to make some paperclips, not as many as I can, just a few trillion, I may still enter a deadly conflict with humanity. If humanity knows about me and my paperclips goal, they’ll shut me down. The most certain way to get those paperclips made may be to eliminate unpredictable humanity’s ability to mess with my plans.

For essentialaly this reason, I think quantilization is and was recognized as a dead-end. You don’t have to take your goals to the logical extreme to still take them way too far for humanity’s good.

I read the this post, but not the remainder yet, so you might’ve addressed this elsewhere.

abstractapplic 28 Apr 2024 21:33 UTC
2 points
0
in reply to: Phil Dyer’s comment on: D&D.Sci
I’m glad you liked it!
(. . . could you spoiler your strategy and win chance? I know this challenge is three years old, and what you mention here isn’t game-breaking info, but I want to keep it possible for people looking/playing through the archives to seek clarifications in the comments without unwittingly/unwillingly learning anything else about the scenario.)